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C6x Assembly Programming 
This module has details of assembly programming on the TI C6000 family of processors. 


Introduction 


This module contains details on how to program the TI C6000 family of processors in assembly. 
The C6000 family of processors has many variants. Therefore, it would not be possible to describe 
how to program all the processors here. However, the basic architecture and instructions are similar 
from one processor to another. They differ by the number of registers, the size of the registers, 
peripherals on the device, etc. This module will assume a device that has 32 general-purpose 32-bit 
registers and eight functional units, like the C6713 processor. 


References 


e¢ SPRU198: TMS320C6000 Programmer's Guide 
e SPRU186: TMS320C6000 Assembly Language Tools User's Guide 
e¢ SPRU733: TMS320C67x/C67x+ DSP CPU and Instruction Set Reference Guide 


Overview of C6000 Architecture 


The C6000 consists of internal memory, peripherals (serial port, external memory interface, etc), 
and most importantly, the CPU that has the registers and the functional units for execution of 
instructions. Although you don't need to care about the internal architecture of the CPU for 
compiling and running programs, it is necessary to understand how the CPU fetches and executes 
the assembly instructions to write a highly optimized assembly program. 


Core DSP Operation 


In many DSP algorithms, the Sum of Products or Multiply-Accumulate (MAC) operations are very 
common. A DSP CPU is designed to handle the math-intensive calculations necessary for common 
DSP algorithms. For efficient implementation of the MAC operation, the C6000 CPU has two 
multipliers and each of them can perform a 16-bit multiplication in each clock cycle. For example, 
if we want to compute the dot product of two length-40 vectors a[n] and x[n], we need to compute: 
Equation: 


(For example, the FIR filtering algorithm is exactly the same as this dot product operation.) When 
an a[n] and x[n] are stored in memory, starting from n=1, we need to compute a[n]x[n] and add it to 
y (y is initially 0) and repeat this up to n=40. In the C6000 assembly, this MAC operation can be 
written as: 


MPY .M a,x,prod 
ADD .L y,prod,y 


Ignore .Mand .L for now. Here, a, X, prod and y are numbers stored in memory and the 
instruction MPY multiplies two numbers a and x together and stores the result in prod. The ADD 
instruction adds two numbers y and prod together storing the result back to y. 


Instructions 


Below is the structure of a line of assembly code. 


Parallel 


Label: bars (I) 


[Condition] Instruction Unit Operands ;Comments 


Labels identify a line of code or a variable and represent a memory address that contains either an 
instruction or data. The first character of a label must be must be in the first column and must be a 
letter or an underscore (_) followed by a letter. Labels can include up to 32 alphanumeric 
characters. 


An instruction that executes in parallel with the previous instruction signifies this with parallel 
bars (||). This field is left blank for an instruction that does not execute in parallel with the previous 
instruction. 


Every instruction in the C6x can execute conditionally. There are five registers available for 
conditions (on the C67x processors): Al, A2, BO, B1, and B2. If blank, the instruction always 
executes. Conditions can take a form such as [A1] where the instruction will execute if A1 is not 
zero. This can be handy for making loops were the counter is put in a register like A1 and is 
counted down to zero. The condition is put on the branch instruction that branches back to the 
beginning of the loop. 


Register Files 


Where are the numbers stored in the CPU? In the C6000, the numbers used in operations are stored 
in register. Because the registers are directly accessible though the data path of the CPU, accessing 
the registers is much faster than accessing data in the external memory. 


The C6000 CPU has two register files (A and B). Each of these files consists of sixteen 32-bit 
registers (AO-A15 for file A and BO-B15 for file B). The general-purpose registers can be used for 
data, data address pointers, or condition registers. The following figure shows a block diagram of 
the C67x processors. This basic structure is similar to other processors in the C6000 family. 
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TMS320C67x DSP Block Diagram taken from SPRU733: 
TMS320C67x/C67x+ DSP CPU and Instruction Set 
Reference Guide 


The general-purpose register files support data ranging in size from 16-bit data through 40-bit 
fixed-point. Values larger than 32 bits, such as 40-bit long quantities, are stored in register pairs. In 
a register pair, the 32 LSB's of data are placed in an even-numbered register and the remaining 8 
MSB's in the next upper register (which is always an odd-numbered register). In assembly language 
syntax, a colon between two register names denotes the register pairs, and the odd-numbered 
register is specified first. For example, A1:A0 represents the register pair consisting of AO and A1. 


Let's for now focus on file A only. The registers in the register file A are named AO to A15. Each 
register can store a 32-bit binary number. Then numbers such as a, X, prod and y above are stored 
in these registers. For example, register AO stores a. For now, let's assume we interpret all 32-bit 
numbers stored in registers as unsigned integer. Therefore the range of values we can represent is 0 
to 232-1. Let's assume the numbers a, x, prod and y are in the registers AO, Al, A3, A4, 
respectively. Then the above assembly instructions can be written specifically: 


MPY .M1 AQ,A1,A3 
ADD .L1 A4,A3,A4 


The TI C6000 CPU has a load/store architecture. This means that all the numbers must be stored in 
the registers before being used as operands of the operations for instructions such as MPY and ADD. 
The numbers can be read from a memory location to a register (using, for example, LDW, LDB 
instructions) or a register can be loaded with a constant value. The content of a register can be 
stored to a memory location (using, for example, STW, STB instructions). 


In addition to the general-purpose register files, the CPU has a separate register file for the control 
registers. The control registers are used to control various CPU functions such as addressing mode, 
interrupts, etc. 


Functional Units 


Where do the actual operations such as multiplication and addition take place? The C6000 CPU has 
several functional units that perform the actual operations. Each register file has 4 functional units 
named .M, .L, .S, and .D. The 4 functional units connected to the register file A are named . M1, 
.L1, .S1, and .D1. Those connected to the register file B are named .M2, .L2, .S2, and .D2. 
For example, the functional unit .M1 performs multiplication on the operands that are in register 
file A. When the CPU executes the MPY .M1 AO, A1, A3 above, the functional unit .M1 
takes the value stored in AO and A1, multiply them together and stores the result to A3. The .M1 in 
MPY .M1 AO, A1, AS indicates that this operation is performed in the .M1 unit. The .M1 unit 
has a 16 bit multiplier and all the multiplications are performed by the .M1 (or .M2) unit. The 
following diagram shows the basic architecture of the C6000 family and functional units. 
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Functional Units of the 'C6x 
taken from SPRU198: 
TMS320C6000 Programmers’ 
Guide 


Similarly, the ADD operation can be executed by the . L1 unit. The . L1 can perform all the logical 
operations such as bitwise AND operation (AND instruction) as well as basic addition (ADD 
instruction) and subtraction (SUB instruction). 

Exercise: 


Problem: 


Read the description of the ADD and MPY instructions in SPRU733 or similar document for 
the processor you are using. Write an assembly program that computes AO* (A1+A2)+A3. 


Typical Assembly Operations 


Loading constants to registers 


Quite often you need to load a register with a constant. The C6x instructions you can use for this 
task are MVK, MVKL, and MVKH. Each of these instructions can load a 16-bit constant to a register. 
The MVKL instruction loads the LOWER 16-bits and the MVKH instruction loads the HIGH 16-bits 
into the register. In order to load 32-bit values into a register, both instructions are needed. 
Exercise: 


Problem: (Loading constants): Write assembly instructions to do the following: 


1. Load the 16-bit constant Oxf 12 to A1. 
2. Load the 32-bit constant Oxabcd45ef to BO. 


Register moves, zeroing 


Contents of one register can be copied to another register by using the MV instruction. There is also 
the ZERO instruction to set a register to zero. 


Loading from memory to registers 


Because the C6x processor has the so-called load/store architecture, you must first load up the 
content of memory to a register to be able to manipulate it. The basic assembly instructions you use 
for loading are LDB, LDH, and LDwW for loading up 8-, 16-, and 32-bit data from memory. (There are 
some variations to these instructions for different handling of the signs of the loaded values.) 


However, to specify the address of the memory location to load from, you need to load up another 
register (used as an address index) and you can use various addressing modes to specify the 
memory locations in many different ways. The addressing mode is the method by which an 
instruction calculates the location of an object in memory. The table below lists all the possible 
different ways to handle the address pointers in the C6x CPU. Note the similarity with the C 
pointer manipulation. 


Syntax Memory address accessed Pointer modification 


*R R None 


Syntax 


*++R 


*R++ 


*R- = 


*+R[ disp] 


*-R[ disp] 


*++R[ disp ] 


*--R[ disp] 


*R++[disp ] 


*R--[disp] 


C6x addressing modes. 


Memory address accessed 


R+disp 


R+disp 


R+disp 


R+disp 


Pointer modification 


Preincrement 


Predecrement 


Postincrement 


Postdecrement 


None 


None 


Preincrement 


Predecrement 


Postincrement 


Postdecrement 


The [disp] specifies the number of elements in word, halfword, or byte, depending on the 
instruction type and it can be either 5-bit constant or a register. The increment/decrement of the 
index registers are also in terms of the number of bytes in word, halfword or byte. The addressing 
modes with displacements are useful when a block of memory locations is accessed. Those with 
automatic increment/decrement are useful when a block is accessed consecutively to implement a 
buffer, for example, to store signal samples to implement a digital filter. 


Exercise: 


Problem: 
(Load from memory): Assume the following values are stored in memory addresses: 


Loc 32-bit value 
100h fe54 7834h 
104h 3459 Ff34dh 
108h 2ef5 7eed4h 
10ch 2345 6789h 
110h ffff eeddh 
114h 3456 787eh 
118h 3f4d 7ab3h 


Suppose A10 = 0000 ©108h. Find the contents of Ai and A10 after executing the each of 
the following instructions. 


.LDW .D1 *A10, At 
LDH .D1 *A10, At 
LDB .D1 *A10, At 
LDW .D1 *-A10[1], A1 
LDW .D1 *+A10[1], A1 
LDW .D1 *+A10[2], A1 
LDB .D1 *+A10[2], A1 
-LDW .D1 *++A10[1], Ad 
. epi Sasol 


ONAUAWNE 


ico) 


10.LDB .D1 *++A10[1], At 
11, DBRS D Secret 
12. LDW .D1 *A10++[1], Al 
13. EDWES DA Ato SS aera 


Storing data to memory 


Storing the register contents uses the same addressing modes. The assembly instructions used for 
storing are STB, STH, and STW. 
Exercise: 


Problem: 


(Storing to memory): Write assembly instructions to store 32-bit constant 53fe 23e4h to 
memory address 0000 ©123h. 


Sometimes, it becomes necessary to access part of the data stored in memory. For example, if you 
store the 32-bit word 0X11223344 at memory location 0X8000, the four bytes having addresses 
location 0X8000, location 0X8001, location 0X8002, and location 0X8003 contain the value 


0x11223344. Then, if I read the byte data at memory location 9X8000, what would be the byte 
value to be read? 


The answer depends on the endian mode of the memory system. In the little endian mode, the 
lower memory addresses contain the LSB part of the data. Thus, the bytes stored in the four byte 
addresses will be as shown in [link]. 


0x8000 0x44 
Ox8001 0x33 
0x8002 0x22 
0x8003 0x11 


Little endian storage mode. 


In the big endian mode, the lower memory addresses contain the MSB part of the data. Thus, we 
have 


0x8000 0x11 


Ox8001 0x22 


0x8002 0x33 


0x8003 0x44 


Big endian storage mode. 


In the C6x CPU, it takes exactly one CPU clock cycle to execute each instruction. However, the 
instructions such as LDW need to access the slow external memory and the results of the load are 
not available immediately at the end of the execution. This delay of the execution results is called 
delay slots. 


Example: 

For example, let's consider loading up the content of memory content at address pointed by A10 to 
A11 and then moving the loaded data to A2. You might be tempted to write simple 2 line assembly 
code as follows: 


1 LDW D1 *A10;. AL 
2 MV D1 Ai, A2 


What is wrong with the above code? The result of the LDW instruction is not available immediately 
after LDW is executed. As a consequence, the MV instruction does not copy the desired value of A1 
to A2. To prevent this undesirable execution, we need to make the CPU wait until the result of the 
LDW instruction is correctly loaded to A1 before executing the MV instruction. For load 
instructions, we need extra 4 clock cycles until the load results are valid. To make the CPU wait 
for 4 clock cycles, we need to insert 4 NOP (no operations) instructions between LDW and MV. 
Each NOP instruction makes the CPU idle for one clock cycle. The resulting code will be like this: 


LDW D1 AIO; AL 
NOP 

NOP 

NOP 

NOP 

MV D1 Ai, A2 


OnRWNE 


or simply you can write 


LDW D1 *A10, Al 
NOP 4 
MV D1 A1,A2 


WONE 


Then, why didn't the designer of the CPU make such that LDW instruction takes 5 clock cycles to 
begin with, rather than let the programmer insert 4 NOPS? The answer is that you can insert other 
instructions other than NOPs as far as those instructions do not use the result of the LDW instruction 
above. By doing this, the CPU can execute additional instructions while waiting for the result of 
the LDW instruction to be valid, greatly reducing the total execution time of the entire program. 


Delay slots 


In the C6x CPU, it takes exactly one CPU clock cycle to execute each instruction. However, the 
instructions such as LDW need to access the slow external memory and the results of the load are 
not available immediately at the end of the execution. This delay of the execution results is called 
delay slots. 


Example: 

For example, let's consider loading up the content of memory content at address pointed by A10 to 
A1 and then moving the loaded data to A2. You might be tempted to write simple 2 line assembly 
code as follows: 


1 LDW D1 A160, AL 
2 MV D1 Ai, A2 


What is wrong with the above code? The result of the LDW instruction is not available immediately 
after LDW is executed. As a consequence, the MV instruction does not copy the desired value of A1 
to A2. To prevent this undesirable execution, we need to make the CPU wait until the result of the 
LDW instruction is correctly loaded to Ai before executing the MV instruction. For load 
instructions, we need extra 4 clock cycles until the load results are valid. To make the CPU wait 
for 4 clock cycles, we need to insert 4 NOP (no operations) instructions between LDW and MV. 
Each NOP instruction makes the CPU idle for one clock cycle. The resulting code will be like this: 


1 LDW D1 *A10, Al 
2 NOP 

3 NOP 

4 NOP 

5 NOP 

6 MV DE A1,A2 

or simply you can write 
1 LDW D1 *A10, Al 


2 NOP 4 


3 MV D1 Ai, A2 


Why didn't the designer of the CPU make such that LDW instruction takes 5 clock cycles to begin 
with, rather than let the programmer insert 4 NOPS? The answer is that you can insert other 
instructions other than NOPs as far as those instructions do not use the result of the LDW instruction 
above. By doing this, the CPU can execute additional instructions while waiting for the result of 
the LDW instruction to be valid, greatly reducing the total execution time of the entire program. 


Description Instructions Delay slots 
Single Cycle All instructions except following 0 
Multiply MPY, SMPY 1 
etc. 
Load LDB, LDH, LDW 4 
Branch B 5 
Delay slots 


The functional unit latency indicates how many clock cycles each instruction actually uses a 
functional unit. All C6x instructions have 1 functional unit latency, meaning that each functional 
unit is ready to execute the next instruction after 1 clock cycle regardless of the delay slots of the 
instructions. Therefore, the following instructions are valid: 


1 LDW D1 *A10, A4 
2 ADD D1 A1,A2,A3 


Although the first LDW instruction do not load the A4 register correctly while the ADD is executed, 
the D1 functional unit becomes available in the clock cycle right after the one in which LDW is 
executed. 


To clarify the execution of instructions with delay slots, let's think of the following example of the 
LDwW instruction. Let's assume ALO = 0X0100A2=1, and your intent is loading A9 with the 32- 
bit word at the address 0X0104. The 3 MV instructions are not related to the LDW instruction. They 
do something else. 


1 LDW D1 *A10++[A2], AQ 
2 MV .L1 A10, A8 

3 MV .L1 = At, AO 

4 MV .L1 Ad, A2 

5 


We can ask several interesting questions at this point: 


1. What is the value loaded to A8? That is, in which clock cycle, the address pointer is updated? 

2. Can we load the address offset register A2 before the LDW instruction finishes the actual 
loading? 

3. Is it legal to load to A10 before the first LDW finishes loading the memory content to A9? That 
is, can we change the address pointer before the 4 delay slots elapse? 


Here are the answers: 


1. Although it takes an extra 4 clock cycles for the LDW instruction to load the memory content 
to AQ, the address pointer and offset registers (A10 and A2) are read and updated in the clock 
cycle the LDW instruction is issued. Therefore, in line 2, A8 is loaded with the updated A10, 
that is A10 = A8 = 0x104. 

2. Because the LDW reads the A10 and A2 registers in the first clock cycle, you are free to 
change these registers and do not affect the operation of the first LDW. 

3. This was already answered above. 


Similar theory holds for MPY and B (when using a register as a branch address) instructions. The 
MPY reads in the source values in the first clock cycle and loads the multiplication result after the 
2nd clock cycle. For B, the address pointer is read in the first clock cycle, and the actual branching 
occurs after the 5th clock cycle. Thus, after the first clock cycle, you are free to modify the source 
or the address pointer registers. For more details, refer Table 3-5 in the instruction set description or 
read the description of the individual instruction. 


Addition, Subtraction and Multiplication 


There are several instructions for addition, subtraction and multiplication on the C6x CPU. The 
basic instructions are ADD, SUB, and MPY. ADD and SUB have 0 delay slots (meaning the results of 
the operation are immediately available), but the MPY has 1 delay slot (the result of the 
multiplication is valid after an additional 1 clock cycle). 

Exercise: 


Problem: 


(Add, subtract, and multiply): Write an assembly program to compute ( 0000 ef35h + 
0000 33dch - 0000 1234h ) * 0000 0007h 


Branching and conditional operations 


Often you need to control the flow of the program execution by branching to another block of code. 
The B instruction does the job in the C6x CPU. The address of the branch can be specified either by 
displacement or stored in a register to be used by the B instruction. The B instruction has 5 delay 
slots, meaning that the actual branch occurs in the 5th clock cycle after the instruction is executed. 


In many cases, depending on the result of previous operations, you execute the branch instruction 
conditionally. For example, to implement a loop, you decrement the loop counter by 1 each time 
you run a set of instructions and whenever the loop counter is not zero, you need to branch to the 
beginning of the code block to iterate the loop operations. In the C6x CPU, this conditional 
branching is implemented using the conditional operations. Although B may be the instruction 
implemented using conditional operations most often, all instructions in C6x can be conditional. 


Conditional instructions are represented in code by using square brackets, [ ], surrounding the 
condition register name. For example, the following B instruction is executed only if BO is nonzero: 


1 [BO] B .L1 AO 


To execute an instruction conditionally when the condition register is zero, we use ! in front of the 
register. For example, the B instruction is executed when BO is zero. 


1 [!BO] B .L1 = AO 


Not all registers can be used as the condition registers. In the C62x and C67x devices, the registers 
that can be tested in conditional operations are BO, B1, B2, A1, A2. 
Exercise: 
Problem: 
(Simple loop): Write an assembly program computing the summation peat n by 
implementing a simple loop. 


Logical operations and bit manipulation 


The logical operations and bit manipulations are accomplished by the AND, OR, XOR, CLR, SET, 
SHL, and SHR instructions. 


Other assembly instructions 


Other useful instructions include IDLE and compare instructions such as CMPEQetc. 


C62x instruction set summary 


The set of instructions that can be performed in each functional unit is as follows (See [link], [link], 
[link] and [link]). Please refer to TMS320C62x/C67x CPU and Instruction Set Reference Guide for 
detailed description of each instruction. 


Instruction Description 

ADD(U) signed or unsigned integer addition without saturation 
ADDK integer addition using signed 16-bit constant 

ADD2 two 16-bit integer adds on upper and lower register halves 
B branch using a register 

CLR clear a bit field 

EXT extract and sign-extend a bit field 


MV move from register to register 


Instruction Description 


MVC move between the control file and the register file 
MVK move a 16-bit constant into a register and sign extend 
MVKH move 16-bit constant into the upper bits of a register 
NEG negate (pseudo-operation) 

NOT bitwise NOT 

OR bitwise OR 

SET set a bit field 

SHL arithmetic shift left 

SHR arithmetic shift right 

SSHL shift left with saturation 

SUB(U) signed or unsigned integer subtraction without saturation 


SUB2 two 16-bit integer integer subs on upper and lower register halves 


Instruction Description 


XOR exclusive OR 
ZERO zero a register (pseudo-operation) 
.S Unit 
Instruction Description 
ABS integer absolute value with saturation 
ADD(U) signed or unsigned integer addition without saturation 
AND bitwise AND 
CMPEQ integer compare for equality 
CMPGT(U) signed or unsigned integer compare for greater than 
CMPLT(U) signed or unsigned integer compare for less than 
LMBD leftmost bit detection 


MV move from register to register 


Instruction 


NEG 


NORM 


NOT 


+OR 


SADD 


SAT 


SSUB 


SUBC 


XOR 


ZERO 


.L Unit 


Description 


negate (pseudo-operation) 


normalize integer 


bitwise NOT 


bitwise OR 


integer addition with saturation to result size 


saturate a 40-bit integer to a 32-bit integer 


integer subtraction with saturation to result size 


conditional integer subtraction and shift - used for division 


exclusive OR 


zero a register (pseudo-operation) 


Instruction 


ADD(U) 


ADDAB 
(B/H/W) 


LDB (B/H/W) 


MV 


STB (B/H/W) 


SUB(U) 


SUBAB 
(B/H/W) 


ZERO 


.D Unit 


Instruction 


MPY (U/US/SU) 


Description 


signed or unsigned integer addition without saturation 


integer addition using addressing mode 


load from memory with a 15-bit constant offset 


move from register to register 


store to memory with a register offset or 5-bit unsigned constant 


offset 


signed or unsigned integer subtraction without saturation 


integer subtraction using addressing mode 


zero a register (pseudo-operation) 


Description 


signed or unsigned integer multiply 16lsb*16lsb 


Instruction Description 


MPYH (U/US/SU) signed or unsigned integer multiply 16msb*16msb 
MPYLH signed or unsigned integer multiply 16lsb*16msb 
MPYHL signed or unsigned integer multiply 16msb*16lsb 
SMPY (HL/LH/H) integer multiply with left shift and saturation 

.M Unit 


Useful assembler directives 
Other than the CPU instruction set, there are special commands to the assembler that direct the 
assembler to do various jobs when assembling the code. There are useful assembler directivesyou 


can use to let the assembler know various settings, suchas .Sset, .macro, .endm, .ref, 
-align, .word, .byte .include. 


The . Set directive defines a symbolic name. For example, you can have 


1 count ,set 40 


The assembler replaces each occurrence of count with 40. 


The . ref directive is used to declare symbolic names defined in another file. It is similar to the 
extern declaration in C. 


The . Space directive reserves a memory space with specified number of bytes. For example, you 
can have 


1 buffer . space 128 


to define a buffer of size 128 bytes. The symbol buffer has the address of the first byte reserved 
by .Space. The . bes directive is similar to . space, but the label has the address of the last 
byte reserved. 


To put a constant value in the memory, you can use .byte, .word, etc. If you have 


1 consti .word 0x1234 


the assembler places the word constant 0X1234 at a memory location and const1 has the 
address of the memory location. . byteetc. works similarly. 


Sometimes you need to place your data or code at specific memory address boundaries such as 
word, halfword, etc. You can use the . align directive to do this. For example, if you have 


1 .align 4 
2 buffer . Space 128 
3 si 


the first address of the reserved 128 bytes is at the word boundary in memory, that is the 2 LSBs of 
the address (in binary) are 0. Similarly, for half-word alignment, you should have . align 
directive to do this. For example, if you have 


1 .align 2 
2 buffer . space 128 
3 en 


the . include directive is used to read the source lines from another file. The instruction 


1 . include ~*other.asm'' 


will input the lines in other . asm at this location. This is useful when working with multiple 
files. Instead of making a project having multiple files, you can simply include these different files 
in one file. 


How do you write comments in your assembly program? Anything that follows ; is considered a 
comment and ignored by the assembler. For example, 


1 ; this is a comment 
2 ADD iD A1, A2,A3 ;add ail and a2 


Assigning functional units 


Each instruction has particular functional units that can execute it. Note that some instructions can 
be executed by several different functional units. 


The following figure shows how data and addresses can be transfered between the registers, 
functional units and the external memory. If you observe carefully, the destination path (marked as 
dst) going out of the .L1, .S1, .M1 and D1 units are connected to the register file A. 


Note: This means that any instruction with one of the A registers as destination (the result of 
operation is stored in one of A registers) should be executed in one of these 4 functional units. 


For the same reason, if the instructions have B registers as destination, the .L2, .S2, .M2 and 
D2 units should be used. 


LD1 32 MSB 
sti 


Register 
file A 
Data path A (AO-A15) 


LD1 32 LSB 


DA1 


DA2 


LD2 32 LSB 


Register 
file B 
(BO-B15) 


Data path B 


LD2 32 MSB 
ST2¢ 


Control 
register 
file 


TMS320C67x DSP Block Diagram taken from SPRU733: 
TMS320C67x/C67x+ DSP CPU and Instruction Set Reference Guide 


Therefore if you know the instruction and the destination register, you should be able to assign the 
functional unit to it. 
Exercise: 


Problem: 
(Functional units): List all the functional units you can assign to each of these instructions: 


1.ADD .?? AO,A1, A2 

2. Baars 

3.MVKL .?? 900023feh, BO 
4.LDW .?? *A10, A3 


If you look at the figure again, each functional unit must receive one of the source data from the 
corresponding register file. For example, look at the following assembly instruction: 


1 ADD .L1 AO, BO, A1 


The .L1 unit gets data from AO (this is natural) and BO (this is not) and stores the result in A1 (this 
is a must). The data path through which the content of BO is conveyed to the . L1 unit is called 
1Xcross path. When this happens, we add x to the functional unit to designate the cross path: 


1 ADD Aix AO, BO, A1 


Similarly the data path from register fileB tothe .M2, .S2 and .L2 units are called 2X cross 
path. 
Exercise: 


Problem: 
(Cross path): List all the functional units that can be assigned to each of the instruction: 


1.ADD .??? BO,A1,B2 
2.MPY .??? A1,B2,A4 


In fact, when you write an assembly program, you can omit the functional unit assignment 
altogether. The assembler figures out the available functional units and properly assigns them. 
However, manually assigned functional units help you to figure out where the actual execution 
takes place and how the data move around between register files and functional units. This is 
particularly useful when you put multiple instructions in parallel. We will learn about the parallel 
instructions later on. 


Writing the inner product program 


Now you should know enough about C6x assembly to implement the inner product algorithm to 
compute 


10 
Y= S Aan XLy 
n=1 
Exercise: 


(Inner product): Write the complete inner product assembly program to compute 


10 
Y= ) An X Ln 
n=1 


Problem: where a,, and z,, take the following values: 


al] 
x[] 


© © 
NO 


~ 


a 
ww 


The a,, and z,, values must be stored in memory and the inner product is computed by reading 
the memory contents. 


Pipeline, Delay slots and Parallel instructions 


When an instruction is executed, it takes several steps, which are fetching, decoding, and 
execution. If these steps are done one at a time for each instruction, the CPU resources are not fully 
utilized. To increase the throughput, CPUs are designed to be pipelined, meaning that the foregoing 
steps are carried out at the same time. 


On the C6x processor, the instruction fetch consists of 4 phases; generate fetch address (F1), send 
address to memory (F2), wait for data (F3), and read opcode from memory (F4). Decoding consists 
of 2 phases; dispatching to functional units (D1) and decoding (D2). The execution step may 
consist of up to 6 phases (E1 to E6) depending on the instructions. For example, the multiply (MPY) 
instructions has 1 delay resulting in 2 execution phases. Similarly, load (LDX) and branch (B) 
instructions have 4 and 5 delays respectively. 


When the outcome of an instruction is used by the next instruction, an appropriate number of NOPs 
(no operation or delay) must be added after multiply (one NOP), load (four NOPs, or NOP 4), and 
branch (five NOPs, or NOP 5) instructions in order to allow the pipeline to operate properly. 
Otherwise, before the outcome of the current instruction is available (which is to be used by the 
next instruction), the next instructions are executed by the pipeline, generating undesired results. 
The following code is an example of pipelined code with NOPs inserted: 


1 MVK 40, A2 
2 loop: LDH *A5++,A0 


3 LDH *A6++, AL 
4 NOP 4 

5 MPY AO, A1, A3 
6 NOP 

7 ADD A3,A4,A4 
8 SUB A2,1,A2 
9 [A2] B loop 
10 NOP 5 
11 STH A4, *A7 


In line 4, we need 4 NOPs because the A11 is loaded by the LDH instruction in line 3 with 4 delays. 
After 4 delays, the value of A1 is available to be used in the MPY AQ, A1,A3 in line 5. Similarly, 
we need 5 delays after the [A2] B loop instruction in line 9 to prevent the execution of STH 
A4, *A7 before branching occurs. 


The C6x Very Large Instruction Word (VLIW) architecture, several instructions are captured and 
processed simultaneously. This is referred to as a Fetch Packet (FP). This Fetch Packet allows C6x 
to fetch eight instructions simultaneously from on-chip memory. Among the 8 instructions fetched 
at the same time, multiple of them can be executed at the same time if they do not use same CPU 
resources at the same time. Because the CPU has 8 separate functional units, maximum 8 
instructions can be executed in parallel, although the type of parallel instructions are limited 
because they must not conflict each other in using CPU resources. In assembly listing, parallel 
instructions are indicated by double pipe symbols (| |). When writing assembly code, by designing 
code to maximize parallel execution of instructions (through proper functional unit assignments, 
etc.) the execution cycle of the code can be reduced. 


Parallel instructions and constraints 


We have seen that C62x CPU has 8 functional units. Each assembly instruction is executed in one 
of these 8 functional units, and it takes exactly one clock cycle for the execution. Then, while one 
instruction is being executed in one of the functional units, what are other 7 functional units doing? 
Can other functional units execute other instructions at the same time? 


The answer is YES. Thus, the CPU can execute maximum 8 instructions in each clock cycle. The 
instructions executed in the same clock cycle are called parallel instructions. Then, what 
instructions can be executed in parallel? A short answer is: as far as the parallel instructions do not 
use the same resource of the CPU, they can be put in parallel. For example, the following two 
instructions do not use the same CPU resource and they can be executed in parallel. 


1 ADD .L1 AO, A1, A2 
2 | | ADD L2 BO, B1, B2 


Resource constraints 


Then, what are the constraints on the parallel instructions? Let's look at the resource constraints in 
more detail. 


Functional unit constraints 


This is simple. Each functional unit can execute only one instruction per each clock cycle. In other 
words, instructions using the same functional unit cannot be put in parallel. 


Cross paths constraints 


If you look at the data path diagram of the C62x CPU, there exists only one cross path from B 
register file to the L1, M1 and S1 functional units. This means the cross path can be used only once 
per each clock cycle. Thus, the following parallel instructions are invalid because the 1x cross path 
is used for both instructions. 


1 ADD .L1x AO, B1, A2 
aa || MPY .M1x A5,BO,A3 


The same rule holds for the 2x cross path from the A register file to the L2, M2 and S2 functional 
units. 


Loads and Stores constraints 


The D units are used for load and store instructions. If you examine the C62x data path diagram, the 
addresses for load/store can be obtained from either A or B side using the multiplexers connecting 
crisscross to generate the addresses DA1 and DA2. Thus, the instructions such as 


1 LDW .D2 *BO, Al 


is valid. The functional unit must be on the same side as the address source register (address 
index in BO and therefore D2 above), because D1 and D2 units must receive the addresses from A 
and B sides, respectively. 


Another constraint is that while loading a register in one register file from memory, you cannot 
simultaneously store a register in the same register file to memory. For example, the following 
parallel instructions are invalid: 


LDW D1 *AO, AL 
|| STW .D2 A2, *BO 


NR 


Constraints on register reads 


You cannot have more than four reads from the same register in each clock cycle. Thus, the 
following is invalid: 


1 ADD Ld Ai, A1, A2 
> ii MPY M1. Ai, A1, A3 
3 «|| SUB D1 Ai, A4, A5 


Constraints on register writes 


A register cannot be written to more than once in a single clock cycle. However, note that the 
actual writing to registers may not occur in the same clock cycle during which the instruction is 
executed. For example, the MPY instruction writes to the destination register in the next clock 
cycle. Thus, the following is valid: 


ADD Riel Al, Al, A2 
|| MPY M1 Al, Al, A2 


NR 


The following two instructions (not parallel) are invalid (why?): 


1 MPY M1 Al, Al, A2 
2 ADD hil A3, A4, A2 


Some of these write conflicts are very hard to detect and not detected by the assembler. Extra 
caution should be exercised with the instructions having nonzero delay slots. 


Ad-Hoc software pipelining 


At this point, you might have wondered why the C6x CPU allows parallel instructions and generate 
so much headache with the resource constraints, especially with the instructions with delay slots. 
And, why not just make the MPY instruction take 2 clock cycles to execute so that we can always 
use the multiplied result after issuing it? 


The reason is that by executing instructions in parallel, we can reduce the total execution time of 
the program. A well-written assembly program executes as many instructions as possible in each 
clock cycle to implement the desired algorithm. 


The reason for allowing delay slots is that although it takes 2 clock cycles for an MPY instruction 
generate the result, we can execute another instruction while waiting for the result. This way, you 
can reduce the clock cycles wasted while waiting for the result from slow instructions, thus 
increasing the overall execution speed. 


However, how can we put instructions in parallel? Although there's a systematic way of doing it 
(we will learn a bit later), at this point you can try to restructure your assembly code to execute as 
many instructions as possible in parallel. And, you should try to execute other instructions in the 
delay slots of those instructions such as MPY, LDW, etc., instead of inserting NOPs to wait the 
instructions produce the results. 


Code Composer Studio v4 Assembly Project 
This module describes the process of creating, building and debugging a 
Code Composer Studio (CCS) v4 assembly program project. 


Introduction 


This module describes how to create a Code Composer Studio (CCS) v4 
project that executes a simple assembly program. The module does not 
explain the code in the project files. The project does not use DSP/BIOS 
and runs on the TI simulator. In this project the processor used is the 
TMS320C67xx but the process should work for other processors as well. 


Create a CCS project 


To create a CCS project select File->New->CCS Project. 
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Screenshot of CCS Project menu 


This will bring up a window where you can enter the name of the project. 
The location selected is the default location for project files. Press Next. 


CCS Project 
Create a new CCS Project. 


Project name: ASMproject 


Use default location 


Location: C:/Users/david.waldo/Documents/workspace/ASMproject 


CCS Project name and location 


Since the example uses the TMS320C67xx processor the project type 
selected is C6000. The project configurations are Debug and Release. 
Select the Next button. 


¥¢ New CCS Project 
Select a type of project 
Select the platform and configurations you wish to deploy on 


peje pe (HO 


Configurations: 


(V] #5 Debug 
[¥] #3 Release 


("| Show All Project Types 
[-] Show All Configurations 


Type of CCS project 


If there are any project dependencies they are selected on the next screen. 
Select the Next button. 


Additional Project Settings 


Define the inter-project dependencies, if any. 


S Projects 
Referenced Projects 


© eb asmex 

© eS asmtest 
 &clklab 

1) &$ helloworld2 
be BS tsklab 
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CCS Project dependencies 


On the Project Settings Screen, select the items that pertain to the type of 
project to be created. Since the project will be executed select Output 
Type: Executable. The processor type is TMS320C67xx so the Device 
Variant is Generic C67xx Device. This project will use Little Endian. 
The code generation tools are the latest version (in this case TI v7.0.3). The 
runtime support library is selected to match the device variant and the 
endianness selected. The library can be selected automatically. Press Next. 


Since this project will have assembly programs only, select the Empty 
Assembly-only Project. Select Finish and the project will show up in the 
C/C++ Projects view. 


Project Settings 


Select the project settings. 
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Runtime Support Library: <automatic> ¥ 
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Project settings window 


we 2 Project 2 


Project Templates 
Select one of the available project templates. 


e- & Empty Projects Creates an empty assembly-only project 

a Empty Project fully initialized for the selected device. 
Empty Assembly-only Project 

_S Empty RTSC Project 

fe Basic Examples 
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Project Templates 
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C/C++ projects view 


Add code files 


There are three files that will be added to the project. One, main.asm, will 
contain the assembly program we want to execute. The file vectors.asm 
will contain the code that gets executed when the reset interrupt occurs. The 
file Link. cmd will be the linker command file that tells the linker where 
section of memory are located and where to put the code in the memory 
map. 


Main program 


Create the main.asm file that conains the assembly program to execute. 
Select File->New->Source File. 
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Close Ctrl+V 
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New source file 


The New Source File dialog opens and you can enter the source file name. 
Add a .asm extension for an assembly program file. The source folder 
should be the folder of your project. Select Finish and a new file is opened. 


Create a new source file. 


Source Folder: ASMproject 


Source File: main.asm| 


New source file dialog 


Enter the assembly code shown below in main.asm. 


; Example assembly file 

; Multiply two numbers 

.text ; Indicates that what follows is code 

.def _c_int00; This symbol is defined in this 
file 

; but used in another file 

_c_int0O: MVK .S1 0x34,A1 ; Put a number in 
register Ail 

MVK .S1 0x25,A2 ; Put a number in register A2 

MPY .M1 A1,A2,A1 ; Multiply two numbers in A1 
and A2 

NOP 

NOP 

NOP 

NOP 

NOP 


Reset vector file 


When a reset interrupt occurs in a C6000 processor the system will start 
executing the code located at the interrupt memory location. There are only 
8 instructions at each interrupt vector. If a long program is to be executed 
then the interrupt vector needs to branch to the program location. This is 
done by using a resect vector program. 


Create a new file named vectors.asm and enter the following code. 


.ref _c_int00O ; name of the label where 

; the program is located 

.sect"vectors" ; this will put this code 

; in the vectors memeory section 

reset: mvkl.s2 _c_int00,b0O ; put the address 
in bo 

mvkh.s2_c_int00, bO 

b .S2 bO ; branch to the program 

nop 

nop 

nop 

nop 

nop 

nop 


Command file 


The linker command file tells the linker information about the memory map 
and where to put parts of the code like data and program instructions. 
Create a new file named Link .cmd and enter the following code. 


MEMORY 
{ 
VECS : origin = 0x00000000 length = 0x00000220 
IPRAM : Origin = 0x00000240 length = 


Ox0000FDCO 


} 
SECTIONS 


{ 
vectors > VECS 
.text > IPRAM 


} 


Create a target configuration 


In order to build and run your project you need to have a target 
configuration. This file will contain information about the emulator that will 
be used and the target processor. In this module the target configuration will 
be the TI simulator for the C6713 processor. First select File->New- 
>Target Configuration File. 
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Open new target configuration file 


In the New Target Configuration dialog enter the configuration file name 
ending in the .ccxml file extension. Put the file in the project directory. 
Select Finish and a new target configuration file opens. 


In the configuration, select the connection needed (simulator or some type 


of emulator, etc.) and the type of device. In this example the TI simulator is 
used with the C67xx little endian simulator device. Select Save when done. 
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‘General Setup 
This section describes the general configuration about the target. 
Connection | Texas Instruments Simulator _ 
Device 67 2 


C6713 Device Cycle Accurate Simulator, Big Endian 
[-] ©6713 Device Cycle Accurate Simulator, Little Endian 
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[| ©6747 Device Cycle Accurate Simulator, Little Endian 
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Now the target configuration is in the project. 
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Project files 


Debug the project 


To start a debug session either select Target->Debug Active Project or 
press the debug icon in the menu bar. 


When the debug session starts the code files are compiled and linked into an 
executable file and the file is loaded onto the target. The reset vector code 
should be located at memroy location 0x0. 


vectors.asm £4 | 8) linkicmd =" \ml 
b -global main ; name of the label where a 
2 7 the program is located 
3 -sect "vectors" ; this will put this cod 
4 7; in the vectors memeory se 
@| 5_reset: mvkl .s2 main,bO ; put the address is | 
6 mvkh +82 main,boO 
7 b .s2 bO ; branch to the program 
8 nop 
9 nop 
i090 nop 
ii nop 
12 nop 
iB | nop 
14 a 
< m r 
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MVK.S2 0x0240,B0 
MVKH.S2 0x0000,B0 
B.S2 BO 

NOP 

NOP 

NOP 

NOP 

NOP 

NOP 

NOP 

NOP 

NOP 


=F Disassembly (S../vectors.asm:5:14$) £ | 
Enter location here 
reset, $../vectors.asm:5:14$: 

» 0x00000000: 0001202A 
0x00000004: oo000006A 
ox00000008: 00000362 
0x0000000C: 00000000 
0x00000010: 00000000 
0x00000014: 00000000 
0x00000018: 00000000 
0x0000001C: 00000000 
0x00000020: ooo00000 
0x00000024: 00000000 
0x00000028: 00000000 
0x0000002C: 00000000 
0x00000030: 00000000 


NOP 


Debugging session 


If your window is small, CCS may not have room to open all windows. In 
this case your debugging window may look like the following figure. 
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Crowded debug window 


Note that the window where the program files are located may not show 
much (circled in red). In this case you will need to expand the window. 


Begin executing the code by stepping through the code using the Assembly 
Step Into button. 
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Assembly 
Step Into 
button 


The code should execute until the branch instruction finishes executing and 
then it should jump to the main function located at a different memory 
location. 


[S} main.asm 23 | fp) link.emd = 6 || es Disassembly (S../main.asm:6:14$) 53 | 
1; a assembly file a Enter location here ¥ | es Wh | 2. ® | es) if 
27 Ly two numbers im ‘text, text, main, $../main.asm:6:14$: 
3 text # Indicates that what f 0x00000240:  00801A28 MVK.S1 0x0034,A1 
4 -def main * This symbol is define 0x00000244:  010012~8 MVK.S1 0x0025,A2 
5 * but used in another f oxoo000248:  ooBs2ceo MPY.ML A1,A2,A1 
>| 6ma Put a number in regis Oxoo00024c:  o0000000 NOP 
7 + Put a number in regis ox00000250: 00000000 NOP 
8 MPY .Mi A1,A2,A1 ; Multiply two numbers ox00000254: 00000000 NOP 
9 NOP oxo0000258: 00000000 NOP 
10 NOP oxo000025C: 090000000 NOP 
ii NOP etext, etext: 
12 NOP 0x00000260: 00000000 NOP 
a pu 0x00000264: 00000000 NOP 
Pe oxo0000268:  o0000000 NOP 
+ oxoo00026C: 00000000 NOP 
«| i r < 


Debugging session after jump to main function 


Creating a TI Code Composer Studio Simulator Project 

This module describes how to create a TI Code Composer Studio v3.3 
project and set it up for use with the simulator. This project uses the C6713 
Device Cycle Accurate Simulator but the general process can be used for 
other types of projects. The project is also set up to use DSP/BIOS with the 
configuration tool. 


Introduction 


Become familiar with how to create a Code Composer Studio v3.3 project 
using the simulator and DSP/BIOS. 


Reading 


e SPRU509: Code Composer Studio Getting Started Guide 


Description 


Before creating a project in Code Composer Studio (CCS), you will need to 
setup CCS for the board or processor you are going to be using. This 
module will describe how to make a project using the C6713 simulator. 


e Run Setup Code Composer Studio. On the right select the C6713 
Device Cycle Accurate Simulator and add it to the left. This is the 
configuration that will be used when CCS is started. Save and quit. 
starts. 
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FB Factory Boards [Ba Custom Boards | Custom Boards | Create Board | 
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Drag a device driver to the left to add a board to the system. y 


CCS Setup for the C6713 Device Cycle Accurate Simulator. 


e Create a new project in CCS by selecting Project->New... This will 
bring up a window that will look like Figure 2. Type the name of your 
project in the Project Name field and it will add a folder with the same 
name in the Location field. Leave everything else the same. Click 
Finish and CCS will open your new project. This will look like Figure 
3. 


Project Name: jest 
Location: _[C.\CCStudio_v3.3\MyProjects\test\ ey 


Project Type: [Executable (out) 7) 
Target [TMS320CE7XX 7] 
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Project creation window 
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Project window 
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After the project is created, make sure the correct processor is chosen 
in the build option. Select Project->Build Options... and select the 
processor that applies to the the configuration you set up in the Code 
Composer Setup tool. Here the C671x is selected to match the C6713 
Device Cycle Accurate Simulator. 
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Processor selection in Build Options 


In order to build a project you must have files in it. To add files you 
must first create the files and save them. Create a new assembly code 
file (*.asm), C code file (*.c) or command file (*.cmd) by selecting 
File->New->Source File. When you save the file give it the correct 
extension. Add the files to the project by selecting Project->Add Files 
To Project... and selecting the files. You can see the files in your 
project by expanding the "+" signs. All projects will need source files 
and a command file. 

For projects that use DSP/BIOS, create anew DSP/BIOS 
Configuration file for the hardware you are using. Select File->New- 


>DSP/BIOS Configuration ... If you are using the C6713 simulator 
then select the ti.platforms.sim67xx template. 

e If you use the template there may be an error in the default settings. If 
there is no heep set up in any memory bank this must be done first. 
Click on System->MEM and check the Segment for DSP/BIOS 
Objects and Segment for malloc()/free(). If they say MEM_NULL you 
need to set up a memory heep. Go to the SDRAM under MEM, select 
the properties and check Create a heep in this memory. The size should 
be 0x00008000. Then go back to System->MEM and change the 
MEM_NULLs to SDRAM. 

e Save the file as configuration file test.tcf and add it to your project. 
Also add the command file test.cmd file to your project. 

e When developing your main C program for your project it should start 
with the following code example. Note that the testcfg.h file is 
created when the test.tcf file is compiled. Save the following 
code in main. c and add it to your project. 


#include <std.h> 
#include "testcfg.h" 


main(){ 
// Startup code goes here 
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Project with tcf and cmd files added 


Set up the project for building your assembly programs by selecting 
Project->Build Options, clicking on the Linker tab and selecting No 
Autoinitialization for the Autoinit Model. When you build your C 
programs you will want to select Load-time Initialization. 

Once you have all your files in your project you can build the project 
to produce the object file (*.out). Select Project->Build to 
incrementally build the project or Project->Build All to rebuild all 
project files. The Project->Build will only perform functions that are 
needed to bring the project up to date. This option will usually be 
quicker than the Project->Build All option. 

Now that the project has been built you must load the object file onto 
the target board or simulator. Do this by selecting File->Load Program 


and selecting the *.out file. The file may be located in a Debug 
subdirectory. 

With the project loaded you can step through the code (Debug->Step 
Into), view the registers (View->CPU Registers->Core Registers) and 
debug the results. There are buttons on toolbars that you should 
become familiar with to aid in the debugging process. 

Since you will be often times loading a program immediately after 
building it you can set up an option so that the load will occur after 
you build your project. Select Option->Customize... and click on the 
Program/Project/CIO tab. Make sure the Load Program After Build 
box is checked. 


Assembly Programming Lab 
This is the lab assignment portion for the Assembly Programming lab. 


Introduction 


The purpose of this lab is to help you become familiar with the internal 
CPU of the 'C6713 processor. Programs are written in assembly. The first 
program is very simple so that you can make sure you know how to write 
an assembly program and debug it. The other two programs are a little more 
involved and help you understand the multiply and sum algorithm, delay 
slots and parallel instructions. 


Part 1 


e Make a new project titled lab1 using the 'C6713 Simulator. 

e Write a program in assembly that multiplies the two constants 0x35 
and 0x10. You will want to review the MVK and MPY instructions. MVK 
is used to put the constants in a register and the MPY is used to 
multiply them. Save this in a file called Lab1p1.asm. 

e Setup the program in Code Composer Studio and run it, recording the 
values in the registers at each step as you single step through the 
program. Be sure to note delay slots. 


Part 2 


e Write an assembly language program that performs a dot product of 
the following vectors using a loop that incorporates a conditional 
branch (The data for a and b should be stored in memory). 


a=[(123456789 1011 12 13 1415 16 17 18 19 20 21 22 23 24 25 26 
27 28 29 30) 


b = [30 29 28 27 26 25 24 23 22 21 2019 1817 1615 1413121110987 
654321] 
Equation: 


y= > ax: by 
k 


Use the ADD command to add the data, SUB to update a counter and a 
conditional B to perform the loop. 


e Save the program as lab1p2.asm and swap it for the other file in 
your project. 

¢ Record the results of the run. Also record the number of clock cycles 
to execute the program. You do not need to optimize your program in 
this part. 


Part 3 


e Write a third program that performs the same operation as in the 
previous part but uses parallel instructions and the register file cross 
path. With the use of parallel instructions and the use of both register 
files the second program should take fewer instruction cycles. Be sure 
to use a loop in your optimized program. 

e Save the program as 1ab1p3.asm and swap it for the other file in 
your project. 

¢ Record the results of the run. Also record the number of clock cycles 
to execute the program. 


Code Composer Studio v4 C Project 
Create a Code Composer Studio v4 C project. 


Introduction 


This module describes how to create a Code Composer Studio (CCS) 
project that executes a simple C program. The project does not use 
DSP/BIOS and runs on the CCS simulator. The code will be a simple "hello 
world" program. In this project the processor used is the TMS320C67xx but 
the process works for other processors as well. 


Create a CCS project 


To create a CCS project select File->New->CCS Project. 


8 C/C++ - Code Composer Studio (Licensed) 

Edit View Navigate Project Target Tools Scripts Window Help 
New File... Standard Make Project 
Open File... RTSC Configuration Project 
Clete Ctrl + C3 Folder 

Close All Ctrl + Shift +4 Gt Source File 

in’ Header File 


trl +Shift+S 


| File 
@ Class 
@ Target Configuration File 


234 DSP/BIOS v5.xx Configuration File 


Pj Other... 


Screenshot of CCS Project menu 


This will bring up a window where you can enter the name of the project. 
The location selected is the default location for project files. Press Next. 


CCS Project 
Create a new CCS Project. 


Project name: helloworld 


Use default location 
Location: C:/Users/david.waldo/Documents/workspace/helloworld | Browse... 


CCS Project name and location 


Since the example uses the TMS320C67xx processor the project type 
selected is C6000. The project configurations are Debug and Release. 
Select the Next button. 


¥¢ New CCS Project 
Select a type of project 
Select the platform and configurations you wish to deploy on 


peje pe (HO 


Configurations: 


(V] #5 Debug 
[¥] #3 Release 


("| Show All Project Types 
[-] Show All Configurations 


Type of CCS project 


If there are any project dependencies they are selected on the next screen. 
Select the Next button. 


Additional Project Settings 
Define the inter-project dependencies, if any. 


& Projects 
Referenced Projects 

tS asmex 

i asmtest 

i clklab 

©] ES helloworld2 

be BS tsklab 


CCS Project dependencies 


On the Project Settings Screen, select the items that pertain to the type of 
project to be created. Since the project will be executed select Output 
Type: Executable. The processor type is TMS320C67xx so the Device 
Variant is Generic C67xx Device. This project will use Little Endian. 
The code generation tools are the latest version (in this case TI v7.0.3). The 
runtime support library is selected to match the device variant and the 
endianness selected. The library can be selected automatically. Press Next. 


ery 
CCS Pro: a9 


Project Settings 
Select the project settings. 


Output Format: legacy COFF v 
Unter Command ile 
Runtime Support Library: <automatic> ¥ 


CCS project settings 


Since this project will be a simple C program project, select either the 
Empty Project or the Hello World example. 


Project Templates 
Select one of the available project templates. 


{=} Empty Projects Simple Hello World executable application + 
5} Empty Project printing the string "Hello World!" to 
[jg Empty Assembly-only Project standard output. 
fg Empty RTSC Project 
=f} Basic Examples 
fj} Hello World 
(=) DSP/BIOS v5.20 Examples 


(H-{| IPC and I/O Examples 


After the projects settings have been set, select Finish and the project will 
show up in the C/C++ Projects view. 


(-fe> asmtest 


cS clklab 


wes helloworld [Active - Debug] 


ates helloworld2 
ofS tsklab 


C/C++ projects view 


Add code files 


Projects in CCS must have a main function. This function gets executed 
first by default. So create a program file that contains a main function. 
Select File->New->Source File. 


C/C++ - Code Composer Studio (Licensed) 


Edit View Navigate Project Target Tools Scripts Window Help 


New File... 
Open File... 
Close Ctrl+V 
Close All Ctrl +Shift+WV 
zz Save Ctrl+$ 
Save As... 
Sy Save ll Ctrl+Shift+$ 
Revert 
Move 
Rename... F2 


CCS Project 

Standard Make Project 

RTSC Configuration Project 
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ih? Header File 

[* File 

@ Class 
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Fy Other... 


New source file 


The New Source File dialog opens and you can enter the source file name. 
Add a.c extension for a C program file. The source folder should be the 
folder of your project. Select Finish and a new file is opened. 


Create a new source file. 


Source Folder: helloworld 


Source File: main.c 


New source file dialog 


Enter the C code. The file must contain a main function. After entering the 
code, save the file. 


i #include <stdio.h> 

2 void main (void) || 

3 printf ("hello world\n"); 
4 

5 


return; 


>| 


Code in main.c 


Create a target configuration 


In order to build and run your project you need to have a target 
configuration. This file will contain information about the emulator that will 
be used and the target processor. In this module the target configuration will 
be the TI simulator for the C6713 processor. 


First select File->New->Target Configuration File. 


3 C/C++ - main.c - Code Composer Studio (Licensed) 
Edit View Navigate Project Target Tools Scripts Window Help 


CCS Project 
New File... Standard Make Project 
Open File... RTSC Configuration Project 
Close Ctrl+W C3 Folder 
Close All Ctrl+Shift+w | [¢/ Source File 
; —= ih’ Header File 
& fda . Ctrl+S c¢ File 
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Open new target configuration file 


In the New Target Configuration dialog enter the configuration file name 
ending in the .ccxml file extension. Put the file in the project directory. 
Select Finish and a new target configuration file opens. 


«+ New Target Configuration 


Target Configuration 
Create a new Target Configuration file. 


File name: €6713_simulatorccxm| 


|] Use shared location 


Location: /helloworld 


Target configuration dialog 


In the configuration, select the connection needed (simulator or some type 
of emulator, etc.) and the type of device. In this example the TI simulator is 
used with the C6713 little endian simulator device. Select Save when done. 


Basic 
r 


» 


‘General Setup “Advanced Setup 
This section describes the general configuration about the target. | 


Connection Texas Instruments Simulator 


Device type filter text 


C64xx CPU Cycle Accurate Compiled Simulator, Little E « 
C64xx CPU Cycle Accurate Simulator, Big Endian 
C64xx CPU Cycle Accurate Simulator, Little Endian 
6713 Device Cycle Accurate Simulator, Big Endian 
6713 Device Cycle Accurate Simulator, Little Endian 
C672x CPU Cycle Accurate Simulator, Little Endian Za 
6745 Device Cycle Accurate Simulator, Little Endian 
C6745 Device Functional Simulator, Little Endian 
C6747 Device Cycle Accurate Simulator, Little Endian 
(6747 Device Functional Simulator, Little Endian ~ 
um i r 


mm, 


IBEBESESEaggg 


~ 


Simulates C67x core, LID, LIP, L2 Caches, EDMA, QDMA,  « 
Timer(2), EMIF interfacing with async & SDRAM memory {a 
models, McBSP(2), McASP(2) and interrupt selector. Does 

not model HPI and 2C. 

This is a cycle accurate device configuration and hence = + 


Note: Support for more devices may be available from the update manager. 
«| mI , r 
f T T 


Target configuration setup 


Now the target configuration is in the project. 


wie asmex 
t-te asmtest 
ce GeS clklab 
[-{ee helloworld [Active - Debug] 
ae Includes 

—— -[e) main.c 


a4 ©6713_simulator.ccxml [Active] 
H)-te$ helloworld2 
GS tsklab 


Project files 


Debug the project 


The default project settings do not have the stack and heap sizes defined. 
This will produce warnings when building the project. To change the 
project properties, select Project->Properties and then click on C/C++ 
Build on the left and the C6000 Linker/Basic Options under Tool 
Settings. For the stack size and heap size enter 0x400 and click OK. 


typefiltertet | | C/C++ Build Aas 


Info 


Builders | Active configuration 
C/C++ Build | Project Type: | C6000 = 
C/C++ Documentatior : i 
C/C++ File Types SEE 9 y| | Manage... 
C/C++ Indexer : : 
Configuration Setti 
CCS Build 2 = 
CCS Debug Tool Settings | Build Settings | Build Steps | Error Parsers | Binary Parser | Environment | Macros 
Project References - - . : 
Refactoring History 3 Basic Settings: Specify output file name (--output_file,-0) "helloworld.out” Browse... 
§} C6000 Compiler 
8 C6000 Linker Input and output sections listed into <file> (--map_file,-m) "helloworld.map" a 

& Basic Options: Set C system stack size (--stack_size, -stack) 0x400 

e& Command File Preprocessing: 

& Diagnostics: Heap size for C/C++ dynamic memory allocation (--heap_size, -heap) 0x400| 

(3 File Search Path: 

3 Linker Output: 

m lanagement: 
Symbol Manag 

(8 Runtime Environment: 

(8 Linktime Optimization: 

(8 Miscellaneous: 


Setting the stack and heap sizes 


To start a debug session either select Target->Debug Active Project or 
press the debug icon in the menu bar. 


Eh 


Debu 
g icon 


When the debug session starts the code files are compiled and linked into an 
executable file and the file is loaded onto the target. The initialization code 
is executed and the program is halted at the beginning of the main 
function. 


[@) main.c 2 | 6713_simulator.ccxml 


1 #include <stdio.h> 
Sy SS esc 


ain(vo 


printf ("hello world\n"); 
return; 


nm wb WwW 


Debugging session 


Icons on the debug window can be used to run the program and other 
debugging tools. 


% ey i Gey |RaAWwWwAM| > wer S 


Debugging tools 


Press the green arrow to run the program. The main function will be 
executed and the program will leave the main function. In the program the 
printf function prints to the console which if not shown by default can be 
viewed by selecting View->Console. In the console is shown the "hello 
world" that was printed. 


EI Console 23 | Properties | 
helloworld [Project Debug Session] C6713 Device Cycle Accurate Simulator, Little Endian_0/TMS32006713: CIO (3:46:16 PM) 
hello world 


Console output of the program 


C Programming Lab 
This is the lab assignment portion for the C Programming lab. 


Skills 


This laboratory deals with compiling and running a C program on the 
TMS320C6713 DSP using Code Composer Studio. By the end of the 
laboratory you will be able to: 


¢ Compile and build a C language program 
e Runa C language program on the TMS320C6713 DSP 
e Link multiple files with functions into one project 


Reading 


e SPRU 187: Optimizing C Compiler 
¢ C programming reference card 


Description 


If you have forgotten how to program in C/C++ you should review C 
programming on the many tutorials on the internet or look through your 
favorite textbook or reference book. This document will not give you a 
tutorial on how to program in C/C++. 


You must remember that when programming the TI DSP you are loading 
your program into a system that has its own DSP and memory and is apart 
from your computer. In order for you to print anything, the data must be 
sent from the DSP board to CCS and then displayed in the stdio display. 
This can be very slow so you will want to use the debugging tools to view 
variables. 


In this lab you will be writing very simple C programs. If you want to print 
something you will obviously need to include the stdio.h file. 


Your main program should look like: 


#include <stdio.h> 
main( ) 


if 


variables 
code 


} 


Be sure to use good programming style. 


Pre-Laboratory 


Write a C program that performs a dot product of the following vectors. The 
data for a and b should be stored in integer arrays. Use a for loop. 


a=[(12345678910 11 12 13 1415 16 17 18 19 20 21 22 23 24 25 26 
27 28 29 30] 


b = [30 29 28 27 26 25 24 23 22 21 2019 1817 16151413121110987 
654321] 
Equation: 


y= > ay: by 
k 


Laboratory 


Part 1 


e In this part you will create a project that performs the array multiply 
and sum within the main function of the program. In order to be able to 
accurately count clock cycles, use the C6713 simulator. 

e Create a new project, call it clab. 

e Create your C file. Call it main.c. Add this file to your project. 


¢ You should use the following options for the project (Project->Build 
Options): 

e After the project has been loaded, record the values in the variables at 
each step as you single step through the program. Count the number of 
clock cycles and compare to the assembly program completed in a 
previous lab. Be sure to zero the counter before stepping through your 
code. 


Part 2 


e In this part you are going to write a function that performs an array 
multiply and sum algorithm and returns the sum. This function will be 
located in another file and you will also make a header file. 

e Create aC file called mul tsum.c. In this file make a function called 
multsum. The function should have as inputs: 


e pointer to integer array a 

e pointer to integer array b 

e number of elements in the arrays 

e The output of the function should be the integer sum. The function 
should perform the multiply and sum algorithm. 


e Create a header file for the multsum function and call it muLtsum.h. 
In the header include the following code: 


#1ifndef —_MULTSUM_H_ 

#define _MULTSUM_H_ 

extern int multsum(int *a, int *b, int length); 
#endif /*MULTSUM_H_*/ 


This uses some compiler directives to determine if the current header file 
has already been included by another include file. The macro 
_MULTSUM_H_ is not defined the first time it is encountered so the macro 
is defined and the definition is included. The extern keyword tells the 
compiler that the function is defined in another file. 


e Inmain.c: 
e include the muLtsum.h file 


#include "multsum.h" 


e declare the a and b arrays 
call the muLtsum function with the sum returned to a variable 


¢ Include the multsum.c file in your project. 
e Run the program and verify that the sum that is returned is correct. 


Circular Buffers 
Circular buffers are used for first-in-first-out (FIFO) buffers. 


Introduction 


When performing DSP algorithms data needs to be stored as it is being 
input. Only a certain amount of data is kept. New data is saved and the old 
data is removed. This is a first-in-first-out (FIFO) buffer. There are two 
ways to implenent the FIFO buffer. One is to shift the data and the other is 
to implement a circular buffer. 


Shifted buffer 


There are two ways of implementing a FIFO buffer. Both methods start 
with defining an array that is the length of data needed to be stored. The 
first method of implementing a FIFO buffer is to add the new data to the 
beginning of the buffer and shifting the old data in the buffer. The last value 
that is shifted out is discarded or used before it is removed. 


New Old 
of 1} 2] 3] 4] 5/ 


To implement the FIFO buffer by shifting the following pseudo-code can be 
used. 


N - number of coefficients 
x(n) - stored input data, n = 0...N-1 


input_samp1e - variable that contains the newest input sample 


// Shift in the new sample. Note the data is 
shifted starting at the end. 

for (1=N-2; 1>=0; i--){ 

x[itt] = x[i]; 


x[0] = input_sample 


Circular buffer 


The second method is the more preferable method since it takes less 
processing. In this method a circular buffer is implemented. An index keeps 
track of the current point in the buffer. The data in the buffer does not get 
shifted, the index gets incremented and wrapped around when it gets to the 
end of the buffer. The element after the current index is the last value in the 
circular buffer. The following figure shows the circular buffer after the 
index has already wrapped around. 


index 


In the following figure is an example of a FIFO circular buffer with the 
number elements equal to 7. In the figure the current index is 3 and that is 
where the new sample is stored. The last element in the buffer is in element 
4. 


0 1 2 3 4 > 6 


| index 


FIFO buffer implemented using a circular buffer 


When a new value is received, the index is incremented (and possibly 
wrapped around to zero), the old value where the index points can be saved 
in a temporary variable to be used later and the new value is written where 
the index points thus deleting the old value. After another increment the 
new representation of the data will look like the following figure. 


0 ak 2 3 4 5 6 


| index 


FIFO buffer implemented using a circular buffer after increment 


In terms of signal processing notation, the value at the index 4 is x(n). Since 
the value at the index 3 came in one time sample before the index 4, it 
corresponds to x(n-1). Similarly, the value at index 2 is x(n-2). 


The following code shows pseudo-code that can be used to implement a 
circular buffer. 


N - number of coefficients 

x(n) - stored input data, n = 0...N-1 

input_samp1e - variable that contains the newest input sample 
index - location in the buffer where the current sample is stored 
index = (index+1)%N; // use the remainder to wrap 


around the index 
X(index) = input_sample; 


Single Sample and Block I/O 
Data can be input to a signal processing algorithm either a single sample at 
a time or a block at a time. 


Introduction 


Signal processing applications that run in "real-time" input data, process the 
data, and then output the data. This can be done either on a single sample 
basis or a block of data at a time. 


Single Sample I/O 


A simple way to perform signal processing algorithms is to get one input 
sample from the A/D converter, process the sample in the DSP algorithm, 
then output that sample to the D/A converter. When I/O is done this way 
there is a lot of overhead processing to input and output that one sample. As 
an example, suppose we will process one sample at a time with the 
following equation 

Equation: 


where is the input and is the output. Here is how the processing 
takes place. When an interrupt occurs, systems usually input a sample and 
output a sample at the same time. Suppose we have a value for the output, 
y(0), in memory before the interrupt occurs. 


Then when the interrupt occurs, the y(0) is output and x(0) is input. 


Interrupt 


Interrupt 


Notice that the processing of the input sample must be done before the next 
interrupt occurs or else we won't have any data to output. Also notice that 
the output is always one sample delayed from the input. This means that the 
current input cannot be instantly output but must wait for the next interrupt 
to occur to be output. 


Block I/O 


In order to minimize the amount of overhead caused by processing one 
sample at a time in-between each interrupt, the data can be buffered and 
then processed in a block of data. This is called "block processing." This is 
the method that will be used in this module and the Code Composer Studio 
(CCS) project. The benefit of block processing is it reduces overhead in the 
I/O processes. The drawback to block processing is that it introduces a 
larger delay in the output since a block of data must be input before any 
data is processed and then output. The length of the block determines the 
minimum delay between input and output. 


Suppose we are processing the data with the same algorithm as above: 
Equation: 


Also, suppose we are going to process the data with a block size of 3. This 
means we need to start with 3 values in our output buffer before we get 
more data. 


Three interrupts occur and we get 3 new values and output the three values 
that are in the output buffer. 


Then it starts all over again. Notice how far down x(0) is moved. This is the 
delay amount from the input to the output. 


Code Composer Studio v3.3 DSP/BIOS and C6713 DSK 
This module describes the TI C6713 DSK hardware and how to set up a 
Code Composer Studio v3.3 DSP/BIOS project for the DSK. 


Introduction 


This module describes the TMWS320C6713 DSK development board and 
how to use it in a Code Composer Studio (CCS) v3.3 project that uses 
DSP/BIOS. An example project is included. 


Reading 


e TMS320C6713 DSK Technical Reference 

e SLWS106D: TLV320AIC23 Technical Manual 

e SPRA677: A DSP/BIOS AIC23 Codec Device Driver for the 
TMS320C6713 DSK 

e SPRU616: DSP/BIOS Driver Developer's Guide 

e SPRA846: A DSP/BIOS EDMA McBSP Device Driver for 
TMS320C6x1x DSPs 


Project Files 


The files referred to in this module can be found in this ZIP file: 
DSK6713 audio.zip 


DSK Hardware 


The following figure shows the block diagram of the TMS320C6713 DSK 
hardware. The heart of the DSK is the TMS320C6713 DSP chip which runs 
at 225 MHz. The DSP is in the center of the block diagram and connects to 
external memory through the EMIF interface. There are several devices 
connected to this interface. One device is a 16 Mbyte SDRAM chip. This 
memory, along with the internal DSP memory, will be where code and data 
are stored. 


On the DSK board there is a TLV320AIC23 (AIC23) 16-bit stereo audio 
CODEC (coder/decoder). The chip has a mono microphone input, stereo 
line input, stereo line output and stereo headphone output. These outputs are 
accessible on the DSK board. The AIC23 figure shows a simplified block 
diagram of the AIC23 and its interfaces. The CODEC interfaces to the DSP 
through its McBSP serial interface. The CODEC is a 16-bit device and will 
be set up to deliver 16-bit signed 2's complement samples packed into a 32- 
bit word. Each 32-bit word will contain a sample from the left and right 
channel in that order. The data range is from - Q0-1) 46 ( Dite- 1) 7) or 
-32768 to 32767. 
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TMS320C613 DSK Block Diagram taken from TMS320C6713 DSK 
Technical Reference 
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Simplified AIC23 CODEC Interface taken from TMS320C6713 DSK 
Technical Reference 


DSK6713 Audio Project Framework 


The following figure shows a diagram of the software that will be used in 
this module. Texas Instruments has written some drivers for the McBSP 
that get data from the AIC23 and write data to the AIC23. The input data is 
put into an input stream (input buffer) and the output data is read from an 
output stream (output buffer). The signal processing software simply needs 
to get a buffer from the input stream, process the data, then put the resulting 
buffer in the output stream. 


The project is set up using DSP/BIOS, a real time operating systems 
developed by TI. This module does not explain how to use DSP/BIOS but it 
will explain what objects are used in this project. The main objects are an 
input stream, inStream, an output stream, outStream, and a task, 
TSK_processing, which uses the function processing( ). 


Input 
Stream 


AIC23 Signal 
Processing 


Output 
Stream 


Block diagram of the signal processing framework. 


There are a few things that can be easily modified by the user in the main 
program file DSK6713_audio.c. 


CODEC Setup 


The CODEC can be set up for different inputs and different sample rates. In 
the program there is a variable that sets up the registers in the CODEC: 


DSK6713_EDMA_AIC23 DevParams AIC23CodecConfig ={ 
OxOOO00AB01, //VersionIid 
0x00000001, //cacheCalls 
0x00000008, //irgId 
AIC23_REGO_DEFAULT, 
AIC23_REG1_DEFAULT, 
AIC23_REG2_ DEFAULT, 


AIC23_REG3_DEFAULT, 

AIC23_REG4_LINE,// Use the macro for Mic or 
Line here 

AIC23_REG5_DEFAULT, 

AIC23_REG6_DEFAULT, 

AIC23_REG/7_DEFAULT, 

AIC23_REG8_48KHZ,// Set the sample rate here 

AIC23_REG9_DEFAULT, 

0x00000001, //intrMask 

0x00000001 //edmaPriority 


+; 


This is set up to use the Line In as the input with the macro 
AIC23_REG4_LINE. If the microphone input is needed then change this 
macro to AIC23_REG4_MIC. The sample rate is set to 48kHz with the 
macro AIC23_REG8_48KHZ. This sample rate can be changed by 
changing the line to one of the following macros: 


AIC23_REG8_8KHZ 
AIC23_REG8_32KHZ 
AIC23_REG8_44_1KHZ 
AIC23_REG8_48KHZ 
AIC23_REG8_96KHZ 


Signal Processing Function 


The main part of the software is an infinite loop inside the function 
processing( ). The loop within the function is shown here. 


while(1) { 

/* Reclaim full buffer from the input stream 
= 7 

if ((nmadus = SIO_reclaim(&inStream, (Ptr 
*)&inbuf, NULL)) < 0) { 


SYS_abort( "Error reclaiming full buffer from 
the input stream"); 


} 


/* Reclaim empty buffer from the output stream 
to be reused */ 

if (SIO_reclaim(&outStream, (Ptr *)&outbuf, 
NULL) < 0) { 

SYS_abort("Error reclaiming empty buffer from 
the output stream"); 


} 


// Even numbered elements are the left channel 
(Silver on splitter) 

// and odd elements are the right channel 
(Gold on splitter). 

/* This simply moves each channel from the 
input to the output. */ 

/* To perform signal processing, put code here 
*/ 

for (1 = 0; 1 < CHANLEN; itt) { 

outbuf[2*i] = inbuf[2*1];// Left channel 
(Silver on splitter) 

outbuf[2*it+1] = inbuf[2*1+1];// Right channel 
(Gold on splitter ) 


/* Issue full buffer to the output stream */ 

if (SIO_issue(&outStream, outbuf, nmadus, 
NULL) != SYS_OK) { 

SYS_abort("Error issuing full buffer to the 
output stream"); 


/* Issue an empty buffer to the input stream 
wa 
if (SIO_issue(&inStream, inbuf, 


SIO_bufsize(&inStream), NULL) != SYS_OK) { 
SYS_abort("Error issuing empty buffer to the 
input stream"); 
Z 
} 


This loop simply gets buffers to be processed with the SIO_reclaim 
commands, processes them, and puts them back with the SIO_issue 
commands. In this example the data is simply copied from the input buffer 
to the output buffer. Since the data comes in as packed left and right 
channels, the even numbered samples are the left channel and the odd 
numbered samples are the right channel. So the command that copies the 
left channel is: 


outbuf[2*i] = inbuf[2*1i];//7 Left channel (Silver 
on splitter ) 


Suppose you wanted to change the function so that the algorithm just 
amplifies the left input by a factor of 2. Then the program would simply be 
changed to: 


outbuf[2*i] = 2*inbuf[2*1i];/7/ Left channel (Silver 
on splitter ) 


Notice that there is a for loop within the infinite loop. The loop is 
executed CHANLEN times. In the example code, CHANLEN is 256. So the 
block of data is 256 samples from the left channel and 256 samples from 
the right channel. When writing your programs use the variable CHANLEN 
to determine how much data to process. 


Buffer Sizes 


In the program file is a macro that is used to define the size of the buffers 
that are in the inStream and outStream buffers. The macro is 


#define CHANLEN 256 // Number of samples per 
channel 


The CCS project uses TI's device drivers described in SPRA846. In the TI 
document SPRA846 it states that "If buffers are placed in external memory 
for use with this device driver they should be aligned to a 128 bytes 
boundary. In addition the buffers should be of a size multiple of 128 bytes 
as well for the cache to work optimally." If the SIO streams LnStream and 
outStream are in external memory then the buffer sizes need to adhere to 
this requirement. If the streams are placed in internal memory then they can 
be any size desired as long as there is enough memory. 


Suppose you want to keep the buffers in external memory but change the 
size. As an example, you want to change the length of each channel to 512 
(multiple of 128). Then change CHANLEN to 512 and then open the 
configuration file DSK6713_audio.tcf and examine the properties for 
inStream. 
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DSP/BIOS SIO object inStream properties 


Notice that the buffer size is 1024. This was for a channel length of 256. In 
the C6713 a Minimum Addressable Data Unit (MADU) is an 8-bit byte. 
The buffer size is in MADUs. Since each sample from the codec is 16-bits 
(2 MADUs) and there is a right channel and a left channel, the channel 
length is multiplied by 4 to get the total buffer size needed in MADUs. 
256*4 = 1024. If you want to change the channel length to 512 then the 
buffer size needs to be 512*4 = 2048. Simply change the buffer size entry 
to 2048. Leave the alignment on 128. The same change must be made to the 


outStream object so that the buffers have the same size so change the 
buffer size in the properties for the outStream object also. 


Suppose you have an algorithm that processes a block of data at a time and 
the buffer size is not a multiple of 128. To make the processing easier, you 
could make the stream buffer sizes the size you need for your algorithm. In 
this case, the stream objects must be placed in internal memory. As an 
example suppose you want a channel length of 94. Set the CHANLEN macro 
to 94 and then open the configuration file DSK6713_audio.tcf and 
examine and modify the properties for LnSt ream and outStream. The 
buffer size will now be 94*4 = 376 and the buffer memory segment must be 
change from external RAM (SDRAM) to the internal RAM (IRAM). 
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Chip Support Library 
CCS v3.3 comes with the chip support library (CSL). If for some reason 


you don't have it or it needs to be updated, download it from the TI website 
and install it. 


Project Setup 


Make an empty DSP/BIOS project 


To create a CCS project select Project->New.... This will bring up a 
window where you can enter the name of the project. The location selected 
is the default location for project files. Note that it is in the install directory 
for CCS v3.3. Select the Target type as TMS320C67xx. Press Finish. 


Project Creation 


Project Name: |DSKE71 3audio 
Location: Ic: \CCStudio_v3.3\MyProjects\DSK67 el 
Project Type: Executable (.out) ¥ | 


Target TMS32006 AX 


Cancel | Help | 


CCS Project name and location 


Copy files from the zip file to project directory. When they are copied to the 
project directory the files need to be added to the project. Right click on the 
project name and select Add Files To Project.... There should be 4 files 
added to the project as well as a header file and a couple of automatically 
generated files. 
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Project view after adding files 


Set up the CSL 


Some options in the project need to be changed in order to use the CSL. To 
get to the project properties select Project->Build Options. Under 
Preprocessor enter the chip that we are using, in this case CHIP_6713. 


Build Options for DSK671 3audio. pjt (Debug) 
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Preprocessor setting for CSL 


The library file for the CSL also needs to be added to the project. Right 
click on the project or where it says Libraries and search for the CSL library 
file named CS16713.11b. It should be located in the directory 

C NCCStudio ve: 3\GGo00\cs i lib. 


Add Command File 


The configuration file, DSK6713_audio.tcf, generates several files, 
one of them being a command file titled DSK6713_audio.cmd. Add this 
file to the project. If it is not there, you may need to attempt to build the 
project so that the file gets generated. 


Using CCS with the DSK 


Startup CCS with DSK connected to PC 


After installing CCS v3.3 the C6713 DSK drivers need to be installed also. 
If they are not already installed then download them from Spectrum Digital, 
manufacturer of the DSK, or install them from the install disk that comes 
with the DSK. 


Add power to the DSK board and then plug the USB connector into the PC 
you are using. Start up CCS. As the splash screen comes, under Windows 
XP, there should be a little window that pops up in the bottom right corner 
of the screen. 


DSK STARTUP 


Waiting for USB Enumeration. 


Popup window when CCS 
starts and DSK is connected 


Connect to target 


CCS v3.3 must first be connected to the target before the executable can be 
loaded onto the board. Select Debug->Connect. This will cause CCS to 
connect to the board and the board status should be in the lower left corner 
of the CCS window. 


GEL StartUp Complete. 


The target is now connected 


ae eae ee GEL Output / 
— 


Target status 


Debug /test the project 


In order to debug and run the program you need to build it and then load it 
onto the DSK board. To build the project, click the Incremental Build 
button. 


Incrementa 
1 Build 
button 


After the project builds successfully, load the . out file to the DSK board 
but selecting File->Load Program... and selecting the file. Or, if you want 
it loaded automatically, select Options->Customize and then on the 
Project/Program/CIO tab select the checkmark next to Load Program 
After Build. When you load the program you will see a status window like 
the following figure. 


Loading Program 


Loading Program 


B ... SNDSK671 3audio\D ebug\DSK671 3audio. out 


Loading sections. 


BERRRREEEe 


ctext: O of 20192 at Ox8000d0e0 


Load Program status window 


Once the program is loaded it is ready to be used. The following figure 
shows the final project with the main program file. 
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Final project view 


Echo Lab 
This module is a laboratory assignment to implement an echo function. 


Introduction 


In this laboratory you will use a program that gathers data from an A/D 
converter, processes the data, and outputs to a D/A converter. The signal 
processing portion will be added by you to perform an echo function. After 
this laboratory you should: 


e Understand how data is received from an A/D converter and sent to a 
D/A converter 

e Understand how signal processing is done in "real-time" 

e Be able to implement a simple signal processing algorithm using 
"block processing" techniques 


Laboratory Assignment 


In the laboratory you will be making a delay function or an echo. To make 
an echo you will need to delay the data that is coming into the DSP. This is 
usually accomplished using a buffer that is used to store previous input 
values. Use a circular buffer to store the input data. The length of the buffer 
is determined by 

Equation: 


N = Delay - SampleRate 


In order to implement the delay an array of sufficient size must be defined. 
When a function is called, the local variables will go onto the stack. The 
stack is usually small compared to other memory sections. Therefore, the 
number of local variables for a function is limited. Since the array for 
storing the previous inputs will be large, the array should be stored 
somewhere else in memory besides on the stack. If the Memory Model in 
the build options is set to Far Aggregate then the global variables and 
the local static variables will be put into the . far section. This section can 
be set to the large external memory. Therefore, the array that stores the 


delay values should be defined as either a global or static local variable. A 
static local variable is a variable that is not initialized each time a function 
is called. The static variable will retain the value it had the last time the 
function was called. To make a static variable just use the static keyword in 
front of the variable definition. 


static int static_var, static_array[10]; 


You need to make sure the . far section is set to a large memory section. 
For our project, set it to the SDRAM memory. To change the location of the 
. far section in the project you should: 


Open the project DSK6713_audio.pjt 

Open the configuration file DSK6713_audio. tcf 

Right click on System->MEM and open Properties 

Click on the Compiler Sections tab 

On the pull down menu next to the . far section select SDRAM 


Pre-Laboratory 


If your sample rate is 8000 Hz and you want a delay of 1/2 second, 
how long should your delay buffer be? 

The main loop in the default I/O program simply inputs a block of data 
and then outputs a block of data. You need to modify the main loop so 
that it will take an input sample, store it in the circular buffer and 
output the sample that is delayed by 1/2 second. Your code should 
implement a circular buffer. Also, make sure you initialize the buffer 
with zeros so that you won't output whatever happens to be in the 
memory. 

What is the transfer function, H(z), of the system in the following 
figure? What is the difference equation? What would happen if you set 
the feedback gain to a value greater than 1? 

Write another program that will implement a fading echo shown in 
following figure. The output of the delay is fed back to the input of the 
delay and added to the input. 


x(n 
” Delay 1/2 
second 


Laboratory 


Part 1 


e In this part you will just get the template project working that simply 
outputs whatever was input. 

e Open the project DSK6713_audio.pjt. 

e Change the parameters in the DSK6713_audio.c file so that the 
sampling rate is 8000 Hz and the input is from the microphone. 

e Connect a microphone to the DSK board microphone input and the 
speakers to the line out. 

e Build, load and run the project. 

e Verify that as you speak into the microphone you hear yourself on the 
speakers. 


Part 2 


e Modify the program DSK6713_audio.C so that one channel of 
output is delayed by 1/2 second. Make sure to define your stored 
samples array as a global or static array in external memory. 

e Build, load and run the project. 

e Demonstrate that you get a 1/2 second delay in the output. 


Part 3 


Modify the program DSK6713_audio.c to implement the fading 
echo. 

Build, load and run the project. 

Demonstrate that you get a 1/2 second delay in the output. 

What would happen if you set the feedback gain to a value greater that 
1? 


Second Order System Impulse Response Generation 
This module describes the recursive generation of the impulse response of a 
second order system. 


Introduction 


This module examines the recursive generation of the impulse response of a 
second order system. This analysis can be used to determine how to generate 
a cosine or sine function recursively. 


Second Order Systems 


The transfer function of a general second order system is given as: 
Equation: 


H( ) by z? + b1z+ bo 
2)2>-- oO 
z27+aiz+az 


The Direct Form II structure for the implementation of the second order 
system is shown in the following figure. 


2 v7) 


The difference equation for the second order system is 
Equation: 


y(n) + ary(n — 1) + agy(n — 2) = box(n) + bi a(n — 1) + box(n — 2) 
Equation: 

y(n) = —ayy(n — 1) — agy(n — 2) = box(n) + bya(n — 1) + boa(n — 2) 
To recursively determine the impulse response of the system start with the 


following: 
Equation: 


Then using Equation 3 the impulse response can be found recursively by: 
Equation: 


h(0) = ayh( 1) ayh(—2) + by d(0) + b16(—1) + b2d(—2) = bo 


Equation: 
h(1) = —a,h(0) — agh(—1) + bod(1) + 616(0) + b26(—1) 
Equation: 
h(1) = —ay1bo + by 
Equation: 


h(2) = —a,h(1) — a2h(0) + bp d(2) + b,0(1) + b2d(0) 


Equation: 


h(2) = ar( a; +t bi) agbo + bo 


For the values of n>2 the recursive equation reduces to 
Equation: 


h(n) = —a,h(n — 1) — agh(n — 2) 


because the values of «(n),a(n — 1) and x(n — 2) will all be zero for n>2. 


So, if you know the values of h(0) and h(1), then the Equation 10 can be 
used to find future values of h(n). 


Cosine Generator Lab 
This module is the recursive cosine generation lab. 


Skills 


In this laboratory you write a program that generates a cosine on the output 
of the D/A converter. After this laboratory you should: 


e Be able to work with transfer functions and difference equations. 
e Be able to get a Visual Basic program to communicate with your DSP 
program on the DSK board. 


Reading 


e Chapter 4: Emmanuel Ifeachor and Barrie W. Jervis, Digital Signal 
Processing: A Practical Approach, Pearson Education Limited, 
England, 2002. 

e Chapter 13 & 14: Mrinal Mandal, Amir Asif, Continuous and Discrete 
Time Signals and Systems, Cambridge University Press, 2007. 


Description 


In this lab you will be making a cosine function generator. This will require 
you to generate the values that go to the D/A converter in order to produce a 
cosine function. Since we are only concerned with the output, the input 
values will be ignored. You will need to determine the values of the cosine 
function necessary to produce it on the output. 


Pre-Laboratory 


e In order to generate a cosine function we will generate the impulse 
response of a system that has the impulse response 


Equation: 


h(n) = cos(an)u(n) 


QJ 


For the given impulse response find the following: 


. Transfer function of the system. Look it up in a ZT table. 
. Direct Form II structure of the system. 
. Pole/zero diagram of the system. Assume a = 1/4 for the example 


diagram. 


. Difference equation of the system. 


Calculate h(0), h(1), and h(2) assuming the input is an impulse x(n) = 
6(n). Simplify the h(2) term. 

Show how you will recursively determine the impulse response of this 
system. 

Suppose the sample rate of the system is 8000 Hz and we want to 
generate a cosine with a frequency of 500 Hz, what should a be to 
implement this cosine function? 

In MATLAB, write a program that will recursively determine the 
impulse response of this system with sample rate 8000Hz and cosine 
of 400 Hz. Put the data into an array and plot the impulse response. 
Generate 100 points. Compare the recursively generated values to the 
direct calculation of the cosine function. 

Modify the while loop Write a program in C that will recursively 
determine the impulse response of this system. 


. The recursive portion should be in an infinite while loop and the data 


should not go into an array but in a single variable. 


. All of the variables should be of type double. 
. In your program you will need to include the header file math. h. 
. The data we are generating is for the D/A converter. The D/A 


converter accepts Int16 type data which has the limits of [-32768, 
32767]. Since the impulse response has the limits [-1, 1] you will need 
to multiply the result by a number to scale the value up to a level that 
will work with the D/A converter without overflowing. Use the value 
32000 .0. The statement will be something like (Int16) 
(32000. 07 y )). 


. To calculate 7, use the following statement: 


pi=16.0*atan(1.0/5.0)-4*atan(1.0/239.0); 


Laboratory 


Part 1 


e Use the sample I/O program for the DSK and generate a 400 Hz cosine 
sampled at 8000 Hz using the recursive program you developed. 

e Change the frequency of the cosine to at least 2 other values from 
about 50 Hz to 4000 Hz and record the results. 

e What happens when you change the frequency of the cosine to 5000 
Hz? Explain what you see. 


Part 2 


e In this part you will be creating a Visual Basic (VB) application that 
communicates with a target application through the Real Time Data 
Exchange (RTDX). The VB application will have two buttons that 
when pressed will send different numbers to the target (DSP board). 
The target application will print the value received to a LOG object 
and turn the output cosine either on or off. 

e Create a new Visual Basic project (These may not be exactly the steps 
depending on your version of Visual Studio): 


. Start Visual Studio. 

. Select Create Project. 

. In the Project types: select Visual Basic: Windows. 

. Under Templates: select Windows Forms Application. 
. The Solution Name will be CosineOnOf f. 

. Click OK. 
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e Add two buttons to your form with the following characteristics: 


1. (Name): Button1 


2. Text: On 
3. (Name): Button2 
4. Text: Off 


e In the Form code in the Form1:(Declarations) section add code for the 
status constants and other variables. This will be right under the 
statement Public Class Formt. The code should be: 


Const Success &HO ' Method call is valid 

Const Failure &H80004005 ' Method call 
failed 

Const ENoDataAvailable = &H8003001E ' No data 
currently available. 

Const EEndOfLogFile = &H80030002 ' End of 
transmission 


Dim rtdx As Object ' Holds the rtdx object 

Dim bufferstate As Integer ' Holds the number 
of bytes xmitted 

" Or pending xmission 

Dim status As Integer ' RTDX Function call 
return status 


e The first function to be executed when a project is started is 
Form: Load. In this function put the code for initializing the RTDX 


channel. Name the channel HtoTchan for Host to Target channel. 


Try 
rtdx_out = CreateObject ("RTDX" ) 
status = rtdx.Open("HtoTchan", "W") ' Open 
channel for writing 
Catch ex As Exception 
"Exception occured when opening the RTDX 
object 


System.Diagnostics.Debug.WriteLine("Opening of 
channel HtoTchan failed") 
rtdx_out = Nothing 
End ' Force program termination 
End Try 


e One of the last functions to execute is the Form: FormCLlosing 
function. In this function put the code for cleaning things up. 


status = rtdx.Close() ' Close rtdx 
Set rtdx = Nothing ' Free memory reserved for 
rtdx obj 


e Inthe Button1:Click function put the code to be executed when 
the button is clicked. When the button is clicked it should send a 4 byte 
integer, the number 1, to the target indicating that button 1 was clicked. 
The code for doing this is: 


Dim Data As Integer 


Data = 1 
status = rtdx.Writel4(Data, bufferstate) 

If status = Success Then 

Debug.Print "Value " & Data & " was sent to 
the target" 


Else 
Debug.Print "WriteI4 failed" 
End If 


e Inthe Button2:Click function put the code to be executed when 
the button is clicked. When the button is clicked it should send a 4 byte 
integer, the number 0, to the target indicating that button 2 was clicked. 
The code is the same as above, the data is just 0. 

e Now the target code needs to be updated. 

e Save the DSK6713_audio.c file as DSK6713_audi02.c and 
use this as the main file in your project. 

e Include a header file so we can use the RTDX. 


#include <rtdx.h> 


e Create a global input channel called HtoTchan. This code will go in 
the global variables section. 


RTDX_CreateInputChannel( HtoTchan ); 


e Add code to the top of function processing: 


int data; 
int status; 
int busystatus=0; 


/* enable the Host to Target channel */ 
RTDX_enableInput( &HtoTchan ); 


e In the while loop add the code: 


/* check to see if the channel is busy before the 
read */ 
if (busystatus == 0) 


/* Print the data if something was actually 
read */ 

if (RTDX_sizeofiInput(&HtoTchan) == 
sizeof (data) ) 


// data received here 

LOG_printf(&trace,"Value sent = %d",data); 

} 

status = RTDX_readNB( &HtoTchan, &data, 
sizeof(data) ); 


/* get the status of the channel */ 
busystatus = RTDX_channelBusy(&HtoTchan) ; 


e The above code will read 4 byte values from the HtoTchan input 
channel. When a value is received the program will get to the 
statement // data received here and will have the value in 
the variable data. 

e Since the variable data will contain either a 1 or a 0 this can be used to 
multiply the cosine value to turn it on and off. (Int16) 
(32000.0*y*data); 

¢ Set up RTDX by opening the Configuration window with Tools- 
>RTDX->Configuration Control, right click to select properties and 
then click on Continuous mode. After closing the properties check the 
Enable RTDX box. 

e Run the target application. It is important that the target application be 
started before the VB application. If the VB application is started first 
the channel may not get initialized. 

e Run the VB application. 

e Verify that the tone that is output can be turned on and off with your 
VB program. 


FIR Filter Structures 


Consider causal FIR filters: y(n) = cone h(k)a(n — k); this can be realized using 
the following structure 


or in a different notation 
Zz! z! z! z! 
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ed 
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This is called the direct-form FIR filter structure. 


There are no closed loops (no feedback) in this structure, so it is called a non- 
recursive structure. Since any FIR filter can be implemented using the direct-form, 
non-recursive structure, it is always possible to implement an FIR filter non- 
recursively. However, it is also possible to implement an FIR filter recursively, and 
for some special sets of FIR filter coefficients this is much more efficient. 


Example: 


where 


But note that 
y(n) = y(n — 1) + a(n) — e(n — M) 
This can be implemented as 


x(n) 


Instead of costing MM — 1 adds/output point, this comb filter costs only two 
adds/output. 


Exercise: 
Problem: Is this stable, and if not, how can it be made so? 


IIR filters must be implemented with a recursive structure, since that's the only way a 
finite number of elements can generate an infinite-length impulse response in a linear, 
time-invariant (LTI) system. Recursive structures have the advantages of being able 
to implement IIR systems, and sometimes greater computational efficiency, but the 
disadvantages of possible instability, limit cycles, and other deletorious effects that 
we will study shortly. 


Transpose-form FIR filter structures 


The flow-graph-reversal theorem says that if one changes the directions of all the 
arrows, and inputs at the output and takes the output from the input of a reversed 
flow-graph, the new system has an identical input-output relationship to the original 


flow-graph. 
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Cascade structures 
The z-transform of an FIR filter can be factored into a cascade of short-length filters 
bop tbyz-1 + bgz FF +... + bz” = bo (1 — nie) (1 _ wae \ ahh (1 — tae} 


where the z; are the zeros of this polynomial. Since the coefficients of the polynomial 
are usually real, the roots are usually complex-conjugate pairs, so we generally 
combine (1 = oe) (1 _ ae) into one quadratic (length-2) section with real 
coefficients 


(1- we) (1 - zz) =1-2 (x%)z 1+ (|z|)°2? = Hi(z) 


The overall filter can then be implemented in a cascade structure. 


This is occasionally done in FIR filter implementation when one or more of the short- 
length filters can be implemented efficiently. 


Lattice Structure 


It is also possible to implement FIR filters in a lattice structure: this is sometimes 
used in adaptive filtering 


x(n) 


FIR Filter Implementation 
This module describes how to implement an FIR filter using a FIFO buffer 
where the elements are shifted and another that uses a circular buffer. 


Introduction 


This module explains how to implement an FIR filter in the Direct-Form 
structure. It explains how data is stored as samples are received and how to 
use a circular buffer in the algorithm. 


Reading 


e For a description of FIR Direct-Form structures see the module FIR 
Filter Structures. 
e For a description of circular buffers see Circular Buffers. 


FIR Filtering 


The difference equation for an N-order FIR filter is given by: 
Equation: 


N-1 
= ¥ bala - k) = S¢ h(k)a(n — k) 


k=0 


For each new input sample, x(n), the filter produces an output y(n). To 
calculate the new output sample, the input must be convolved with the FIR 
filter impulse response, h(n). Notice that the summation requires the input 
samples starting with the most recent sample, x(n) through the N-1 most 
recent samples, through x(n - N -1). Ina DSP system, these previous 
samples must be saved in an array in memory. Usually there will be one 
new sample received, x(n), and any previous samples that were received 
must be saved in memory. 


FIFO Buffer 


Since the algorithm only needs the previous N-1 samples to be saved there 
must be a way for the algorithm to update the array when a new sample is 
input. This can be accomplished in two ways. The first way is to shift all the 
samples in the array by copying them to the shifted location and then 
storing the new value at the beginning of the array. This is a first-in-first-out 
(FIFO) buffer. If this method is used then the FIR algorithm is given by 
Equation: 


N-1 
Yout = S° PUB) Peaycal kh) 
k=0 


Figure 1 shows a diagram of the implementation of the FIR filter using the 
array shifting method. In the diagram the "*" indicates multiplication. The 
new value is shifted into the x(0) spot and all the others are shifted to the 
right. 


x(n) x(n-1) x(n-2) x(n-(N-1)) 
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Diagram of the FIR filter implementation using the 
shifting method 


The following pseudo-code shows how to implement the FIR filter using 
the FIFO type buffer. In the code the following definitions are used: 


N - number of coefficients 

h(n) - filter coefficients, n = 0...N-1 

x(n) - stored input data, n = 0...N-1 

input_samp1e - variable that contains the newest input sample 


// Shift in the new sample. Note the data is 
shifted starting at the end. 

for (1=N-2; 1>=0; i--){ 

x[itt] = x[i]; 


x[0] = input_sample 

// Filter the data 

ynew = 0, 

for (1=0;1<N;1++){ 

ynew = ynew + h(1)*x(1); 


} 


Double Size FIFO Buffer 


In the example above, the data in the FIFO buffer is shifted every time a 
new sample is received. This could lead to inefficient code. It is better to do 
processing on blocks of data. For instance, you could use a DMA to copy a 
block of data from one place to another and free up the processor from 
doing the work. 


One solution could be to make a FIFO buffer that is twice as large as the 
data that is needed to store. When this is done, shifting of the data can be 
done after a block of data is received. 


The following figures show the data buffer as data is being received and 
stored. Figure 2 shows the buffer when the first new value is received. Of 
course the buffer would be zeroed before processing begins. The shaded 
area shows the part of the buffer that is used in the algorithm. 


Data buffer at the beginning 


Figure 3 shows the buffer after M samples have been received. Note that 
there is no need to shift data in the buffer. The shaded area is what is being 
used for the fitler. 


Data buffer after M values have been stored 


Figure 4 shows the buffer when the new sample is stored in the last element 
of the buffer. At this point the shifting of the data is needed to remove old 
data. The old data is at the beginning of the buffer. This data is just 
overwritten with the data at the end of the buffer. 


Data buffer when the index reaches the end of the buffer 


Figure 5 shows the buffer after the data has been copied. The algorithm will 
now use the data at the beginning of the buffer just as it did in figure 2. 
Notice that a large block of data is copied at one time. 


Data buffer showing the data that is copied from the end to the 
beginning 


The following pseudo-code shows how to implement the FIR filter using 
the double size FIFO type buffer. In the code the following definitions are 
used: 


N - number of coefficients 

h(n) - filter coefficients, n = 0...N-1 

x(n) - stored input data, n = 0...2N-1 

input_samp1e - variable that contains the newest input sample 


index - variable that contains the current place where the new sample is to 
be stored 


x[index] = input_sample 
// Filter the data 
ynew = 0; 
for (1=0;1<N;1++){ 
ynew = ynew + h(i)*x(index-1); 


} 


// update the index and copy the data if 
necessary 

index = index+1; 

if (index>=2*N) { 

for (1=N-2; 1>=0; i--){ 

X[1+N+1] = x[1]; 


index =N - 1; 


} 


Circular Buffer 


The second method for storing input data is use a circular buffer. To do this 
you will need to keep track of an index that tells where the last input was 
stored. This index is incremented and used to store the new value. When the 
index reaches the end of the array it is wrapped around to the beginning of 
the array. If this method is used then the FIR algorithm is 

Equation: 


N-1 
Yout = )_, h(k)a((index — k)) 
k=0 


where (+) is modulo-N. This assumes that the index is incremented as 
new data is stored in the buffer. The following figure shows how the data 
for x(n) is stored in a circular buffer. 


When implementing the FIR filter using a circular buffer, notice that the 
coefficient buffer indexes going forward and the data buffer indexes going 
backward. 

N - number of coefficients 


h(n) - filter coefficients, n = 0...N-1 


x(n) - stored input data, n = 0...2N-1 


input_samp1e - variable that contains the newest input sample 


index - variable that contains the current place where the new sample is to 
be stored 


X[index] = input_sample 
// Filter the data 
ynew = 0, 
for (1=0;1<N;1++){ 
if ((index-1)<0){ 
ynew = ynew + h[i]*x[index-i+N]; 
t 
else { 
ynew = ynew + h[i]*x[index-1i]; 
hs 
} 


index = (index+1)%N; 


Double Size Circular Buffer 


In the previous implementation of the circular buffer notice that there is an 
if statement in the filter loop. This is very time consuming. So to remove 
this from the loop a buffer that is twice the length can be used. The new 
value that is received is put in two places within the buffer. 


In the following figure shows a buffer for a circular buffer of size 4. 
Therefore the total length of the buffer is 8. The two indexes point to the 
places where the input sample is stored. In the figure the shaded boxes are 
the intems that make up the filter elements. Since the samples do not wrap 
around the end of the buffer there is no need to check the offset in the loop. 
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Double sized circular buffer with two indexes 


After the filter output is determine the indexes are incremented and checked 
to see if they go past the end of the buffer. The following figure shows the 
buffer and indexes after one increment. 


OO 
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Buffer after increment of indexes 


The following figure shows the indexes after another increment. Notice that 
the indexes wrap around and the element that are used still don't wrap 
around the end of the buffer. 
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Buffer after indexes wrap around the end 


To implement the double size circular buffer the following pseudo-code can 
be used. 


N - number of coefficients 


h(n) - filter coefficients, n = 0...N-1 


x(n) - stored input data, n = 0...2N-1 


input_samp1e - variable that contains the newest input sample 


index1 - the first place in x(n) where the new sample is to be stored 


index2 - the second place in x(n) where the new sample is to be stored 


// variable are 
values 


initialized to the following 


int indexi=0; 
int index2=N; 


// AS a new sample is received put it in two 


places 
x[index1] 
x[index2 | 
ynew = 0; 


for (1i=0;1<N; 


input_sample; 
input_sample; 


1++){ 


ynew = ynew + h[i]*x[index2-1]; 


} 


index1 
index2 


(index1+1)%N; 
indexi+N; 


FIR Filter Lab 
This module is the OC FIR filter lab. 


Introduction 


This module is a lab assignment for implementing an FIR filter using a 
circular buffer. 


Reading 


e For a description of FIR Direct-Form structures see the module FIR 
Filter Structures. 

e For a description of circular buffers see Circular Buffers. 

e FIR Filter Implementation. 


Filter Implementation 


In this lab you will be writing a function fir_filt that will take as its 
input one new sample and produce one output sample. The fir_filt 
function will have the following basic structure: 


Store the input value in a circular buffer 
Compute the new output value 

Return the new output value 

Store the coefficients in in a header file 
Here are some tips for writing the function. 


e Store the coefficients as float values. 

e Store the input as Int16 values (use a global variable or a local static 
variable). This is not very good programming practice but we will do it 
to simplify the processing. 


e The intermediate sum in the filter calculation should be stored in a 
float variable initialized to 0.0. 

e Be sure to cast the output f Loat variable to an Int 16 before output. 
eg.return (Int16) var; 

e When calling the function use a statement like outbuf[2*i] = 
eter eae (noite 2 i |) 


Pre-Laboratory 
e Design a Bandpass FIR filter with the following requirements: 


. Bandpass 

. Stopband attenuations: 40 dB 

. Passband: 1300Hz to 2000 Hz 

. Transition widths: 600 Hz 

. Sample rate: 8000 samples/sec. 

. DO NOT USE BUILT IN FUNCTIONS IN MATLAB for filter 
designs. You can use functions like window and sinc. 
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e For the filter design, your prelab report should include a plot of the 
filter coefficients and the magnitude and phase plots of the filter 
spectrum with non-normalized Hz as the x-axis. 

e Save the coefficients to an ASCII data file to be used in your C 
function below. It would be advantageous to have this completely 
automated so that you could simply change the bandpass region and 
output all of the above for a new frequency range (in MATLAB see 
fopen, fprintf, fclose). Your header file should have the 
coefficients in an array, be saved in a format like this: 

3 .007445391758509e- 003 and be data type Float. 

e Write a C function called Fir_filt that takes as its input one input 
time point and outputs one filtered time point. Use the coefficients 
designed above. Use a circular buffer to store the saved values of the 
input. 

e Write a program that filters one channel of the CODEC input and zeros 
the other channel. The data should be filtered with the function 
alee here. 


e Your function fir_filt should be located in a separate file from 
you main program. Also, create a header file to contain the declaration 
of your function fir_filt. 

e Your files should be organized something like this: 


1. DSK6713_audio.c - contains your main code for sending and 

receiving data and filtering the signal. 

2. bpf .h - contains your filter coefficients and a macro definition for the 
length of your filter (e.g. #define N 70). This file will be included 
in (pata ec. 

.fir_filt.c - contains your FIR filter function called fir_filt. 

4.fir_filt.h - contains the declaration of your FIR filter 

fir_filt. This will be included in the DSK6713_audio.c file. 


QJ 


Laboratory 


e Run the program on the DSK and demonstrate the filtered channel. 
When you build the project you may need to have the -o3 option 
selected. This will help optimize the code. If you are debugging the 
code you will need to set this back to None since the optimized code 
does not work well in debugging. 

e Use an input sinusoid and measure the amplitude of the output at 
different frequencies to test the frequency response. You could use a 
spectrum analyzer to get a power spectral density of your filter or you 
could use your laptop to analyze the system. 

e You should verify your spectrum that you determined in your pre-lab 
work. Use a good experimental write-up for this lab. 


